Parameter Estimation¶
This course is derived from the course Introduction to Data Analysis in the Biological Sciences by Justin Bois, 2024 at Caltech. The original course material has been changed by Matthias Fuegger and Thomas Nowak.
Maximum likelihood estimation¶
Once we have written down our statistical model, we would like to get estimates for the parameters of the model. By far the most often used method for parameter estimation is maximum likelihood estimation. To understand how this works, we first need to define what a likelihood is. Note: Here we are specifically talking about the likelihood function in a frequentist setting, not the likelihood in a Bayesian context.
The likelihood function¶
Say we have some data $\mathbf{y}$ that we are modeling parametrically. The generative distribution is parametrized by a set of parameters $\theta$. Thus, the PDF for the generative distribution of all of the data is $f(\mathbf{y};\theta)$.
We define the likelihood function $L(\theta;\mathbf{y})$ as
\begin{equation} L(\theta;\mathbf{y}) = f(\mathbf{y}; \theta). \end{equation}
This may look weird, but it is easier understood as the PDF of the generative distribution, except where it is a function of the parameters $\theta$ and not of the data $\mathbf{y}$. The likelihood is therefore not a probability nor is it a probability density.
We can define $\ell(\theta;\mathbf{y}) = \ln L(\theta;\mathbf{y})$ as the log-likelihood function, which we will find is often useful to work with.
Maximum likelihood estimate¶
The set of parameter values $\theta^*$ for which the likelihood function (and therefore also the log-likelihood function) is maximal is called the maximum likelihood estimate, or MLE. Generically, we can denote the parameter values that maximize the likelihood function as $\theta^*$. That is,
\begin{equation} \theta^* = \mathrm{arg\;max}_\theta\; L(\theta;\mathbf{y}) = \mathrm{arg\;max}_\theta \;\ell(\theta;\mathbf{y}). \end{equation}
The MLE is not the true value of the parameter. Instead, it is an estimate of the true parameter acquired by the procedure of finding the maximizer of the likelihood function.
The MLE has some nice properties, which is probably why it is so widely used to estimate parameters. We will not prove these properties here, but will instead state them.
- As the number of measurements gets very large, the MLE converges to the true parameter value. This is called consistency.
- Also as the number of measurements gets large, the MLE has the smallest variance among all well-behaved estimators for parameters. This is called asymptotic optimality.
- If $\theta^*$ is an MLE for the parameters $\theta$, then $g(\theta^*)$ is the MLE for the parameters $g(\theta)$, where $g$ is an arbitrary function. This is called equivariance.
Confidence intervals for MLEs¶
The MLE is not the true value of the parameter, but it is rather an estimate of it computed from data. Therefore, the MLE is a random variable. It can vary meaningfully from experiment to experiment, so we can compute confidence intervals for the MLE using bootstrapping. In essence, we can resample the data set to get a bootstrap sample. We use this bootstrap sample in our procedure for computing the MLE. The MLE computed from the bootstrap sample is a replicate. We gather many replicates and can then compute a bootstrap confidence interval as we have done in the nonparametric setting.
Computing an MLE¶
Computing an MLE involves writing down the likelihood function (or, more often the log-likelihood function) and finding its maximizer. For all but the simplest of models, this is done numerically.
However, it is often quite useful to make analytical progress in characterizing or computing MLEs. The numerical calculation can be difficult for many reasons, including high-dimensionality of the likelihood function, or multiple local maxima. In some cases, we can directly compute the MLE analytically, which can save numerical headaches.
MLE for rate of arrival of a Poisson process¶
We can model the time to catastrophe as Exponentially distributed when we considered it to be the result of the arrival of a single Poisson process. So, imagine we have a data set containing $n$ times to catastrophe. Assuming that these times are i.i.d. and Exponentially distributed, the joint PDF is
\begin{equation} f(\mathbf{t};\beta) = \prod_{i=1}^n \beta\,\mathrm{e}^{-\beta t_i} = \beta^n\,\exp\left[-\beta\sum_{i=1}^n t_i\right]. \end{equation}
The log-likelihood function is then
\begin{equation} \ell(\beta; \mathbf{t}) = n\ln \beta - \beta\sum_{i=1}^n t_i = n\ln \beta - n \, \beta\,\bar{t}, \end{equation}
where $\bar{t}$ is the arithmetic mean of the measured catastrophe times. The log-likelihood is maximal when its first derivative vanishes, or
\begin{equation} \frac{\mathrm{d}}{\mathrm{d}\beta}\,\ell(\beta; \mathbf{t}) = \frac{n}{\beta^*} - n \bar{t} = 0. \end{equation}
This tells us that the MLE is $\beta^* = 1/\bar{t}$.
Recall that the mean of an Exponential distribution is $1/\beta$, and also that the plug-in estimate for the mean is $\bar{t}$. This means that for the Exponential distribution, the plug-in estimate for the mean also gives the maximum likelihood estimate for $\beta$ via the equivariance property of the MLE. (That is, the MLE of $1/\beta$ is $1/\beta^*$.)
Note that this may not be a good model. We assume the model is true, and under that assumption, we find the parameters that maximize the likelihood function. Within the context of the model, we get the (possibly scientifically irrelevant if the model is wrong) maximum likelihood estimate.
MLE for the mean and variance of a Normal distribution¶
As a second example, consider the length of C. elegans eggs, modeled as Normally distributed and i.i.d.. In this case, the generative joint PDF, as we have previously written down, is
\begin{equation} f(\mathbf{y}; \mu, \sigma) = \left(\frac{1}{2\pi \sigma^2}\right)^{n/2}\,\exp\left( -\frac{1}{2\sigma^2}\sum_{i=1}^n(y_i-\mu)^2 \right). \end{equation}
After some algebraic grunge, it can be shown that
\begin{equation} \sum_{i=1}^n(y_i-\mu)^2 = n(\bar{y} - \mu)^2 + n\hat{\sigma}^2, \end{equation}
where
\begin{equation} \hat{\sigma}^2 = \frac{1}{n}\sum_{i=1}^n (y_i - \bar{y})^2 \end{equation}
is the plug-in estimate for the variance. Thus, the joint PDF is
\begin{equation} f(\mathbf{y}; \mu, \sigma) = \left(\frac{1}{2\pi \sigma^2}\right)^{n/2}\,\exp\left[-\frac{n(\bar{y} - \mu)^2 + n\hat{\sigma}^2}{2\sigma^2}\right]. \end{equation}
The log-likelihood function is then
\begin{equation} \ell(\mu, \sigma; \mathbf{y}) = \text{constant} -n\ln \sigma - \frac{n(\bar{y} - \mu)^2 + n\hat{\sigma}^2}{2\sigma^2}. \end{equation}
The log-likelihood is maximal when
\begin{align} \frac{\partial\ell}{\partial \mu} &= 0 ,\\ \frac{\partial\ell}{\partial \sigma} &= 0. \end{align}
Evaluating the first expression,
\begin{equation} \frac{\partial\ell}{\partial \mu} = -\frac{n}{(\sigma^*)^2}(\bar{y}-\mu^*) = 0. \end{equation}
The only solution is $\mu^* = \bar{y}$. So, as was the case with the Exponential distribution, the plug-in estimate for the mean gives the MLE for the mean of the Normal distribution.
Next, to compute the MLE for $\sigma$, we consider the second equation.
\begin{equation} \frac{\partial\ell}{\partial \sigma} = - \frac{n}{\sigma^*} + \frac{n(\bar{y} - \mu^*)^2 + n\hat{\sigma}^2}{(\sigma^*)^3} = 0. \end{equation}
We already found, though, that $\mu^* = \bar{y}$, so we have
\begin{equation} - \frac{n}{\sigma^*} + \frac{n\hat{\sigma}^2}{(\sigma^*)^3} = 0. \end{equation}
Solving, we get that $(\sigma^*)^2 = \hat{\sigma}^2$, which says that the MLE of the variance is also given by its plug-in estimate. This result reveals that the MLE can have bias, as did the plug-in estimate.
In all three parameters we estimated (the rate parameter of the Exponential, and the location and scale parameters of the Normal), the MLE was given by the plug-in estimate for the statistical functional whose value is given by the parameter. Note, though, that this is not generally the case.
Method of moments¶
There is another widely used method for obtaining parameter estimates that we have not covered, the method of moments. Parameter estimates from the method of moments do not have the same desirable properties as those from maximum likelihood estimation, so estimates from the method of moments are not as useful in practice in parametric models. They are, however, useful for getting initial guesses for numerical maximum likelihood estimates. For this reason, and because the method of moments is still fairly widely used (though far less than MLE) and you should be familiar with it, we will discuss it here.
We first introduce the method of moments formally, and then demonstrate with examples.
Definition of method of moments¶
For ease of discussion, we will consider a generative model of a set of $n$ i.i.d. measurements drawn from a generative distribution with probability density function $f(x;\theta)$, where $\theta$ are the parameters. The mth moment of the distribution is
\begin{equation} \langle x^m \rangle = \int x^m f(x;\theta)\ \mathrm{d}x. \end{equation}
In the case where $x$ is discrete, the $m$th moment is
\begin{equation} \langle x^m \rangle = \sum_x\,x^m f(x;\theta). \end{equation}
The moments will in general be functions of the parameters $\theta$.
Recalling our discussion of plug-in estimates of linear statistical functionals, the plug-in estimate for the $m$th moment is
\begin{equation} \widehat{\langle x^m \rangle} = \frac{1}{n}\sum_{i=1}^n\,x_i^m, \end{equation}
where $x_i$ is one of the $n$ empirical measurements.
In the method of moments, we equate the plug-in estimates for moments with the analytical expressions for the moments derived from the model generative distribution and then solve for $\theta$ to get our estimates. This is best seen in practice.
Examples of method of moments¶
We will demonstrate the method of moments with a few examples.
Estimates for an Exponential distribution¶
For an Exponential distribution, the probability density function is
\begin{equation} f(t;\beta) = \beta\,\mathrm{e}^{-\beta t}. \end{equation}
There is a single parameter, $\beta$. The first moment of the Exponential distribution is $\langle x \rangle = 1/\beta$, which may be computed by evaluating the integral
\begin{equation} \int_0^\infty t\,f(t;\beta)\ \mathrm{d}t = \int_0^\infty t\,\beta\,\mathrm{e}^{-\beta t}\ \mathrm{d}t = 1/\beta, \end{equation}
but is more easily looked up in the Distribution Explorer. The plug-in estimate for the first moment is
\begin{equation} \widehat{\langle t \rangle} = \frac{1}{n}\sum_{i=1}^n\,t_i = \bar{t}. \end{equation}
Equating the moment from the model generative distribution and the plug-in estimate gives $\beta = 1/\bar{t}$, which is our estimate by method of moments for the parameter $\beta$.
Estimates for a Normal distribution¶
We now turn to the Normal distribution. We will again use the first moment, but we need to use a second moment to get a second equation, since the Normal distribution has two parameters, $\mu$ and $\sigma$. It is often the case that it is easier to compare the variance and the plug-in estimate for the variance instead of the second moment. Note that the variance can be expressed in terms of the first and second moments,
\begin{equation} \sigma^2 = \langle x^2 \rangle - \langle x \rangle^2, \end{equation}
so that even though we are comparing the variance and the plug-in estimate for the variance, we are still doing method of moments.
The first moment and variance of a Normal distribution are $\mu$ and $\sigma^2$, respectively. Equating them with the plug-in estimates gives our method-of-moments estimates for the parameters, $\mu = \bar{x}$ and $\sigma = \hat{\sigma}$, where $\hat{\sigma}$ is the plug-in estimate for the variance.
Estimates for a Gamma distribution¶
You may have noticed that for the Exponential and Normal distributions, the method of moments gives the MLE. This is not generally the case. As a counter example, consider a Gamma distribution, which has probability density function
\begin{equation} f(x;\alpha, \beta) = \frac{1}{\Gamma(\alpha)}\,\frac{(\beta x)^\alpha}{x}\,\mathrm{e}^{-\beta x}, \end{equation}
with two parameters, $\alpha$ and $\beta$. Equating the first moment and variance with their respective plug-in estimates gives
\begin{aligned} &\frac{\alpha}{\beta} = \bar{x},\\[1em] &\frac{\alpha}{\beta^2} = \hat{\sigma}^2. \end{aligned}
Solving for $\alpha$ and $\beta$ yields our method-of-moments estimates for the parameters,
\begin{aligned} &\alpha = \frac{\bar{x}^2}{\hat{\sigma}^2},\\[1em] &\beta = \frac{\bar{x}}{\hat{\sigma}^2}. \end{aligned}
This is not the same result as given by maximum likelihood estimation. In fact, the MLE cannot be written analytically.
Numerical maximum likelihood estimation¶
import itertools
import os
import warnings
import numpy as np
import polars as pl
import scipy.optimize
import scipy.stats as st
import iqplot
import bebi103
import bokeh.io
import bokeh.plotting
bokeh.io.output_notebook()
/Users/tnowak/.pyenv/versions/3.12.4/lib/python3.12/site-packages/dask/dataframe/__init__.py:49: FutureWarning: Dask dataframe query planning is disabled because dask-expr is not installed. You can install it with `pip install dask[dataframe]` or `conda install dask`. This will raise in a future version. warnings.warn(msg, FutureWarning) /Users/tnowak/.pyenv/versions/3.12.4/lib/python3.12/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html from .autonotebook import tqdm as notebook_tqdm
Maximum likelihood estimates for parameters may sometimes be computed analytically, but often cannot. In those cases, we need to resort to numerical methods. We will demonstrate some numerical methods to perform maximum likelihood estimates of parameters form a Negative Binomial distribution.
The data set¶
The data come from the Elowitz lab, published in Singer et al., Dynamic Heterogeneity and DNA Methylation in Embryonic Stem Cells, Molec. Cell, 55, 319-331, 2014, available here.
In this paper, the authors investigated cell populations of embryonic stem cells using RNA single molecule fluorescence in situ hybridization (smFISH), a technique that enables them to count the number of mRNA transcripts in a cell for a given gene. They were able to measure four different genes in the same cells. So, for one experiment, they get the counts of four different genes in a collection of cells.
The authors focused on genes that code for pluripotency-associated regulators to study cell differentiation. Indeed, differing gene expression levels are a hallmark of differentiated cells. The authors do not just look at counts in a given cell at a given time. The temporal nature of gene expression is also important. While the authors do not directly look at temporal data using smFISH (since the technique requires fixing the cells), they did look at time lapse fluorescence movies of other regulators. We will not focus on these experiments here, but will discuss how the distribution of mRNA counts acquired via smFISH can serve to provide some insight about the dynamics of gene expression.
The data set we are analyzing now comes from an experiment where smFISH was performed in 279 cells for the genes rex1, rest, nanog, and prdm14. The data set may be downloaded at https://s3.amazonaws.com/bebi103.caltech.edu/data/singer_transcript_counts.csv.
Exploratory data analysis¶
We first load in the data set and generate ECDFs for the mRNA counts for each of the four genes.
# Load DataFrame
df = pl.read_csv(os.path.join('data', 'singer_transcript_counts.csv'), comment_prefix='#')
# Take a look
df.head()
Rex1 | Rest | Nanog | Prdm14 | |
---|---|---|---|---|
i64 | i64 | i64 | i64 | str |
11 | 34 | 39 | 0 | null |
172 | 91 | 33 | 5 | null |
261 | 70 | 68 | 0 | null |
178 | 54 | 88 | 1 | null |
129 | 54 | 41 | 0 | null |
Each of the 279 rows has the mRNA counts for each of the four genes. There may be multiple cell types present, and we do now know how many there are. Our aim here is not to find how many cell types there are, but to demonstrate how MLE works. Nonetheless, we should have some idea of the properties of the data set we are exploring. We can start by plotting ECDFs for each of the four genes. It is useful to have the gene names around.
genes = ["Nanog", "Prdm14", "Rest", "Rex1"]
plots = [
iqplot.ecdf(
data=df.get_column(gene),
q=gene,
x_axis_label="mRNA count",
title=gene,
frame_height=150,
frame_width=200,
)
for gene in genes
]
bokeh.io.show(bokeh.layouts.gridplot(plots, ncols=2))
Note the difference in the x-axis scales. Clearly, prdm14 has far fewer mRNA copies than the other genes. The presence of two inflection points in the Rex1 EDCF implies bimodality, leading us to suspect that there may be two cell types, or at least more than one cell type. We can plot all pairwise combinations of gene counts to explore further.
For visualizing the data, it will be useful to have labels for the cells so we can compare expression levels in all four cells at one. We will label them according to the Rex1 levels.
# Add cell label, ranked lowest to highest in Rex1 expression
df = df.with_columns(pl.col('Rex1').rank(method='dense').alias('cell'))
# Add colors for plotting, using quantitative to color conversion in bebi103 package
df = df.with_columns(pl.Series(name='color', values=bebi103.viz.q_to_color(df["cell"], bokeh.palettes.Viridis256)))
We will now look at all of the pair-wise plots of mRNA expression. We color the dots according to cell ID, with cell 1 having the lowest count of Rex1 mRNA and the last cell having the highest.
def pairwise_plot(df, gene1, gene2):
p = bokeh.plotting.figure(
frame_height=150,
frame_width=150,
x_axis_label=gene1,
y_axis_label=gene2,
)
p.scatter(source=df.to_dict(), x=gene1, y=gene2, color="color", size=2)
return p
plots = [
pairwise_plot(df, gene1, gene2) for gene1, gene2 in itertools.combinations(genes, 2)
]
bokeh.io.show(bokeh.layouts.gridplot(plots, ncols=3))
It appears as though there is a correlation between Rest, Nanog, and Rex1. They tend to be high or low together. Prdm14, on the other hand, shows less correlation.
Model for mRNA levels¶
We will now model gene expression of each of the four genes separately, though they are connected by which cell is being measured. We will discuss that later. For now, we develop a model for the mRNA counts for a given gene.
If gene expression is a purely Poisson process, we might expect a Poisson distribution. Or, if the copy number is itself somehow tightly regulated, we might expect a Normal distribution.
Study of gene expression dynamics, largely through fluorescence imaging, has lead to a different story. Expression of many important genes can be bursty, which means that the promoter is on for a period of time in which transcripts are made, and then it is off for a while. The "on" periods are called "bursts" and are themselves well-modeled as a Poisson process. That is to say that the amount of time that a promoter is on is Exponentially distributed. Thus, we can think of a burst as a series of Bernoulli trials. A "failure" is production of an mRNA molecule, and a "success" is a switch to an off state. The number of "successes" we get is equal to the number of bursts we get per decay time of the mRNA. We can define the number of bursts before degradation of the mRNA as $\alpha$. This is the so-called burst frequency. So, we have a series of Bernoulli trials and we wait for $\alpha$ successes. Then, $n$, the total number of failures (which is the number of mRNA transcripts), is Negative Binomially distributed, since this matches the Negative Binomial story. Referring to the parametrization used in the distribution explorer,
\begin{align} n \sim \text{NBinom}(\alpha, \beta), \end{align}
where $\beta$ is related to the probability $\theta$ of a success of a Bernoulli trial by $\theta = \beta/(1+\beta)$.
The meaning of the parameter $\beta$, and the related quantity $\theta$, can be a little mystical here. We would like to relate it to the typical burst size, i.e., the typical number of transcripts made per burst. The size of a single given burst (that is, the number of transcripts made in a burst) is geometrically distributed (since it matches that story), so
\begin{align} f(n_\mathrm{burst} ; \theta) = (1-\theta)^{n_\mathrm{burst}}\,\theta. \end{align}
The mean number of transcripts $b$ in a burst is
\begin{align} b \equiv \left\langle n_\mathrm{burst}\right\rangle &= \sum_{n_\mathrm{burst}=0}^\infty n_\mathrm{burst}(1-\theta)^{n_\mathrm{burst}}\theta\\[1em] &= \theta \sum_{n_\mathrm{burst}=0}^\infty n_\mathrm{burst}(1-\theta)^{n_\mathrm{burst}} \\[1em] &= \theta(1-\theta)\, \frac{\mathrm{d}}{\mathrm{d}(1-\theta)}\sum_{n_\mathrm{burst}=0}^\infty(1-\theta)^{n_\mathrm{burst}} \\[1em] &= \theta(1-\theta)\, \frac{\mathrm{d}}{\mathrm{d}(1-\theta)}\,\frac{1}{\theta}\\[1em] &= -\theta(1-\theta)\, \frac{\mathrm{d}}{\mathrm{d}\theta}\,\frac{1}{\theta} \\[1em] &= \frac{1-\theta}{\theta} \\[1em] &= \frac{1}{\beta}. \end{align}
So we now see that $1/\beta$ is the typical burst size. Using the Negative Binomial property of mRNA copy numbers of bursty gene expression, we can characterize the expression levels of a given cell type by the two parameters of the Negative Binomial, the burst frequency $\alpha$ and the burst size $b = 1/\beta$. These are the two parameters we would like to infer from transcript count data. The conclusion of all this is that we have have our likelihood.
\begin{align} &n \sim \text{NBinom}(\alpha, \beta),\\[1em] &b = 1/\beta. \end{align}
Maximum likelihood estimation by numerical optimization¶
To compute the MLE for the two parameters, the burst frequency $\alpha$ and burst size $\beta$, we need to define the likelihood function. We make the assumption that the number of transcripts in each cell is i.i.d., giving a statistical model of
\begin{align} n_i \sim \text{NBinom}(\alpha,\beta)\;\forall i. \end{align}
Referring to the PMF of the Negative Binomial distribution and making the change of variables $b=1/\beta$, the likelihood function is
\begin{align} L(\alpha, b;\mathbf{n}) = \prod_i\frac{\Gamma(n_i+\alpha)}{\Gamma(\alpha)n!}\left(\frac{1}{1+b}\right)^\alpha\left(\frac{b}{1+b}\right)^{n_i}, \end{align}
and the log-likelihood is
\begin{align} \ell(\alpha, b;\mathbf{n}) = \ln L(\alpha, b;\mathbf{n}) = \sum_i \ln \left(\frac{\Gamma(n_i+\alpha)}{\Gamma(\alpha)n!}\left(\frac{1}{1+b}\right)^\alpha\left(\frac{b}{1+b}\right)^{n_i}\right). \end{align}
To find the MLE, we need to find the values of $\alpha$ and $b$ that satisfy
\begin{align} \frac{\partial \ell}{\partial \alpha} = \frac{\partial \ell}{\partial b} = 0. \end{align}
Unfortunately, no closed form solution exists for this. We therefore need to resort to numerical optimization to find the MLE $\alpha^*$ and $b^*$.
Numerical optimization¶
Numerical optimization is typically implemented to find a minimizers of a function rather than maximizers. The function being minimized is called an objective function. This is not a problem for maximum likelihood estimation; we simply define a negative log-likelihood as our objective function.
Sometimes, we have constraints on the allowed values for the parameters. In our case, both $\alpha$ and $\beta$ must be non-negative. So, the statement of the optimization problem to find the MLE is
\begin{align} \text{minimize } (-\ell(\alpha, \beta;\mathbf{n})) \text{ s.t. } \alpha, \beta > 0, \end{align}
where "s.t." is read "subject to." If we explicitly consider the constraints, we are performing a constrained optimization problem. Constrained optimization is often considerably more challenging than unconstrained optimization. There are ways around simple positivity constraints such as the ones here. We can instead define new variables $\xi_\alpha = \ln \alpha$ and $\xi_b = \ln b$, and write the log-likelihood in terms of these variables instead. We then find minimizing $\xi_\alpha$ and $\xi_b$ and convert them to $\alpha$ and $\beta$ by exponentiation after performing the minimization calculation.
Numerical optimization is implemented in the scipy.optimize
submodule (docs). Most of the functionality you need is in the scipy.optimize.minimize()
function. To use the function to find minimizers of an objective function, the standard call signature is
scipy.optimize.minimize(fun, x0, args=(), method='BFGS')
The fun
argument is a function with call signature fun(x, *args)
, where x
is the variables used in the optimization. In the case of MLE, the function is the negative log-likelihood function, x
is always an array of the parameter values we are trying to estimate, and the remaining arguments are additional arguments passed into the likelihood function, which always include the measured data. Importantly, we have to provide a guess as to which values of the parameters are optimal. This is passed as an array x0
. The kwarg args
specifies which additional arguments are to be passed to fun()
. Note that args
must be a tuple. Finally, the method
keyword argument specifies which numerical optimization method to use, the default being the Broyden–Fletcher–Goldfarb–Shanno algorithm. This is a good algorithm but does compute derivatives, so it is only useful if the parameter values can take on any real value.
I have omitted the bounds
keyword argument here because we will not usually use them, as we will either do the logarithm trick above, or use Powell's method, which does not required calculation of derivatives (so we may therefore have discontinuities in the objective function and set the value of the objective function to be infinity for disallowed parameter values).
Guess at optimal parameters¶
We will use the method of moments to guess the optimal parameters. Referring to the Distribution Explorer for the first moment and variance, we can equate them with the plug-in estimates.
\begin{align} &\langle n \rangle = \frac{\alpha}{\beta} = \alpha b = \bar{n},\\[1em] &\sigma^2 = \frac{\alpha(1+\beta)}{\beta^2} = \alpha b(1+b) = \hat{\sigma}^2. \end{align}
Solving gives our method-of-moments estimates,
\begin{align} &\alpha_\mathrm{mom} = \frac{\bar{n}}{b} = \frac{\bar{x}^2}{\hat{\sigma}^2 - \bar{n}},\\[1em] &b_\mathrm{mom} = \frac{\hat{\sigma}^2}{\bar{n}} - 1. \end{align}
We can compute the estimates from the plug-in values.
# Extract the values for Nanog
n = df.get_column('Nanog').to_numpy()
# Compute
b_mom = np.var(n) / np.mean(n) - 1
alpha_mom = np.mean(n) / b_mom
# Take a look
print(f"α_mom = {alpha_mom}\nb_mom = {b_mom}")
α_mom = 1.7700110004078886 b_mom = 49.569381904553985
Note that the method-of-moments is not the only way to come up with guesses. As an example, if the data are not actually generated from a Negative Binomial distribution (or even if they are and there are not too many data points), then we may have $\hat{\sigma} < \bar{n}$, which would result in a negative estimate for both $\alpha$ and $b$, which are not even allowed values of the parameters. It turns out that in this case, the optimization works just fine if we arbitrarily pick something like $\alpha \approx \beta \approx 3$.
Solving with the BFGS algorithm¶
We will now solve the minimization problem using the BFGS algorithm, specifying the parameters using logarithms to make sure that the problem is completely unconstrained. First, we have to write a function for the log-likelihood matching the required function signature of the input fun
to scipy.optimize.minimize()
. Note that we do not have to hand-code the log-likelihood. The scipy.stats
module has functions to compute the log-PDF/log-PMF for many distributions. We just need to check the Distribution Explorer to ensure we use the parametrization that the scipy.stats
module requires. In this case, it expects parameters alpha
and 1/1+b
.
def log_like_iid_nbinom_log_params(log_params, n):
"""Log likelihood for i.i.d. NBinom measurements with
input being logarithm of parameters.
Parameters
----------
log_params : array
Logarithm of the parameters alpha and b.
n : array
Array of counts.
Returns
-------
output : float
Log-likelihood.
"""
log_alpha, log_b = log_params
# Convert from log parameters
alpha = np.exp(log_alpha)
b = np.exp(log_b)
return np.sum(st.nbinom.logpmf(n, alpha, 1/(1+b)))
With the log likelihood specified, we simply use -log_like_iid_nbinom_params()
as our objective function, which we can succinctly code up as an anonymous (lambda) function. Let's perform the optimization for the nanog gene and look at the result.
# Solve, making sure to use log parameters
res = scipy.optimize.minimize(
fun=lambda log_params, n: -log_like_iid_nbinom_log_params(log_params, n),
x0=np.array([np.log(alpha_mom), np.log(b_mom)]),
args=(n,),
method='BFGS'
)
res
message: Desired error not necessarily achieved due to precision loss. success: False status: 2 fun: 1524.9284357729396 x: [ 2.338e-01 4.241e+00] nit: 11 jac: [ 3.204e-04 1.068e-04] hess_inv: [[ 8.545e-09 2.968e-07] [ 2.968e-07 1.043e-05]] nfev: 308 njev: 96
The result returned by scipy.optimize.minimize()
is an OptimizeResult
object that has several attributes about how the optimization calculation went, including if it was successful. Importantly, the optimal log-parameter values are in the array x
. We can extract them and exponentiate them to get the MLE.
alpha_mle, b_mle = np.exp(res.x)
print("α: ", alpha_mle)
print("b: ", b_mle)
α: 1.263433985420747 b: 69.44437003467432
So, the MLE for the burst frequency is about 1.25 inverse degradation times. The MLE for the burst size is about 70 transcripts per burst.
Solving using Powell's method¶
As an alternative to the BFGS method, we can use Powell's method. This has the advantage that we do not have to use derivatives in the optimization, so we do not have to use logarithms of the parameters. We do, however, need to specify that the log-likelihood is minus infinity for disallowed parameter values.
def log_like_iid_nbinom(params, n):
"""Log likelihood for i.i.d. NBinom measurements."""
alpha, b = params
if alpha <= 0 or b <= 0:
return -np.inf
return np.sum(st.nbinom.logpmf(n, alpha, 1/(1+b)))
We take a similar approach to solving using Powell's method. This time, we will catch warnings because the solver will stumble into regions where the log-likelihood is minus infinity. We know this to be the case, as we designed it that way, so we will suppress the warnings to keep our notebook clean. We will tighten the tolerance on the solver using the tol
keyword argument to insist on very small gradients when finding the minimum. This results in a more accurate MLE at the cost of more computational effort.
with warnings.catch_warnings():
warnings.simplefilter("ignore")
res = scipy.optimize.minimize(
fun=lambda params, n: -log_like_iid_nbinom(params, n),
x0=np.array([alpha_mom, b_mom]),
args=(n,),
method='Powell',
tol=1e-6,
)
if res.success:
alpha_mle, b_mle = res.x
else:
raise RuntimeError('Convergence failed with message', res.message)
print("α: ", alpha_mle)
print("b: ", b_mle)
α: 1.2634329776733069 b: 69.44440337333538
This differs from the result we got with BFGS in the third or fourth decimal place, due to inaccuracies in introducing the logarithms, but the difference is not big.
The likelihood function¶
To help give a picture of what the likelihood function looks like, and what the optimizer is doing, we can plot it. In this case, we have two parameters, so we can make a contour plot. We first compute the log-likelihood for various values of $\alpha$ and $b$.
# alpha and b values for plotting
alpha = np.linspace(1, 1.5, 100)
b = np.linspace(50, 90, 100)
# Compute log-likelihood for each value
log_like = np.empty((100, 100))
for j, alpha_val in enumerate(alpha):
for i, b_val in enumerate(b):
log_like[i, j] = log_like_iid_nbinom((alpha_val, b_val), n)
Remember that the likelihood function is not a probability distribution, so it is not normalized. When we exponentiate the log-likelihood, we may get values close to zero, or very large. It is therefore a good idea to first subtract the maximal value of all computed log-likelihoods. This has the effect of multiplying the likelihood function by a constant.
like = np.exp(log_like - log_like.max())
Now, we can make a contour plot using the bebi103.viz.contour()
function.
p = bebi103.viz.contour(alpha, b, like, overlaid=True, x_axis_label="α*", y_axis_label="b*")
bokeh.io.show(p)
Graphically, we can see that we appropriately found the maximum.
A quick visualization¶
We can do a quick visualization of our MLE to see if the model holds up. We can conveniently use the scipy.stats
module to generate the CDF. It is probably overkill, since we have such a wide range of mRNA counts, but we will take care to make sure we plot the theoretical CDF as a staircase, since it is for a discrete distribution.
p = iqplot.ecdf(data=df['Nanog'], q='Nanog', conf_int=True)
n_theor = np.arange(0, df['Nanog'].max()+1)
cdf_theor = st.nbinom.cdf(n_theor, alpha_mle, 1/(1+b_mle))
# Weave together to make staircase for discrete distribution
n_plot = np.empty(2 * len(n_theor))
cdf_plot = np.empty(2 * len(n_theor))
cdf_plot[0] = 0
cdf_plot[1::2] = cdf_theor
cdf_plot[2::2] = cdf_theor[:-1]
n_plot[::2] = n_theor
n_plot[1::2] = n_theor
p.line(n_plot, cdf_plot, line_color='orange', line_width=2)
bokeh.io.show(p)
The MLE curve deviates from the nonparametric ECDF 95% confidence interval. This suggests we may be missing something in our model.