11 Likelihood

Let’s consider a sequence of random variables $X_{n} = (X_{1}, \dots, X_{n})$ with known parametric joint density defined by a vector of parameters $θ_{0}$ with dimension $p$ . Then, $f$ can be seen as a function that takes as input a vector in $R^{n + p}$ and gives as output a scalar in $R^{+}$ . In fact, $n$ -arguments comes from a possible realized sample $x_{n} = (x_{1}, \dots, x_{n})$ , while $p$ -arguments from the vector of parameters $θ = (θ_{1}, \dots, θ_{p}) \in Θ$ and $Θ$ is an admissible parameters’ space, i.e. $f_{X_{1}, \dots, X_{n}; θ_{1}, \dots, θ_{p}} : R^{n} \times R^{p} \to R^{+}$ In general, when the set of parameters $θ$ is considered fixed and the input of the function is only a vector of random variables or scalar in $R^{n}$ , then $f$ is called density function, i.e. $f_{X_{n}} (X_{n} ∣ θ) : R^{n} \to R^{+},$ On the other hand, when we fix a sample $X_{n} = x_{n}$ to a particular value and we let the vector of parameters be the input of the function be the vector of parameter one obtain the observed likelihood, i.e. $f_{X_{n}} (θ ∣ X_{n} = x_{n}) : R^{p} \to R^{+},$ that represents the probability of observing the realized sample ( $X_{n} = (x_{1}, \dots, x_{n})$ ), given a vector of parameters $θ$ . Usually, the likelihood is denoted as $\begin{matrix} (11.1) & L (θ ∣ X_{n} = x_{n}) = L (θ ∣ x_{1}, \dots, x_{n}) . \end{matrix}$ In practice, for a given value of $θ$ , the likelihood express how likely it is that the data are generated under the distributive law implied by $f_{X_{1}, \dots, X_{n}}$ . The observed log-likelihood function $ℓ$ is computed by taking the logarithm of the likelihood (Equation 11.1), i.e. $\begin{matrix} (11.2) & ℓ (θ ∣ X_{n} = x_{n}) = \log (L (θ ∣ X_{n} = x_{n})) . \end{matrix}$ When the log-likelihood is differentiable, we can define the observed gradient (Equation 31.9) of the log-likelihood with respect to the parameter’s vector $θ$ and computed on a realized sample $x_{n}$ , i.e. $\nabla_{ℓ} (θ ∣ x_{n}) = (\begin{matrix} \frac{\partial ℓ}{\partial θ_{1}} (θ ∣ x_{n}) \\ \frac{\partial ℓ}{\partial θ_{2}} (θ ∣ x_{n}) \\ ⋮ \\ \frac{\partial ℓ}{\partial θ_{p}} (θ ∣ x_{n}) \end{matrix})$ and the observed Jacobian (Equation 31.10) of the gradient, i.e.
$J_{ℓ} (θ ∣ x_{n}) = (\begin{matrix} \partial_{θ_{1}}^{2} ℓ (θ ∣ x_{n}) & \partial_{θ_{1}} \partial_{θ_{2}} ℓ (θ ∣ x_{n}) & \dots & \partial_{θ_{1}} \partial_{θ_{p}} ℓ (θ ∣ x_{n}) \\ \partial_{θ_{2}} \partial_{θ_{1}} ℓ (θ ∣ x_{n}) & \partial_{θ_{2}}^{2} ℓ (θ ∣ x_{n}) & \dots & \partial_{θ_{2}} \partial_{θ_{p}} ℓ (θ ∣ x_{n}) \\ ⋮ & ⋮ & ⋱ & ⋮ \\ \partial_{θ_{p}} \partial_{θ_{1}} ℓ (θ ∣ x_{n}) & \partial_{θ_{p}} \partial_{θ_{2}} ℓ (θ ∣ x_{n}) & \dots & \partial_{θ_{p}}^{2} ℓ (θ ∣ x_{n}) \end{matrix}),$ If we consider the gradient as function of the random variables, then the population Score is the expected value of the gradient, i.e. $\begin{matrix} (11.3) & S (θ) = E {\nabla_{ℓ} (θ ∣ X_{n})}, \end{matrix}$ Notably, when the joint density function is correctly specified, the Score computed at the true population parameter $θ_{0}$ is equal to zero, i.e.
$S (θ_{0}) = 0 .$ Similarly, we define the population Hessian the matrix defined as the expected value of the Jacobian of the Score, i.e. $\begin{matrix} (11.4) & H (θ) = E {J_{ℓ} (θ ∣ X_{n})} \end{matrix}$ Notably, the following relation $E {J_{ℓ} (θ ∣ X_{n})} = E {\nabla_{ℓ} (θ ∣ X_{n}) \nabla_{ℓ}^{⊤} (θ ∣ X_{n})}$ holds only if the model is correctly specified, otherwise under misspecification the does not anymore. The Fisher information is related to the population Hessian matrix as follows $\begin{matrix} (11.5) & I (θ) = - H (θ) . \end{matrix}$

11.1 ML Estimator

In statistics, a method of estimating the unknown true vector of parameters $θ_{0}$ is the Maximum Likelihood (ML).

Under the assumption that the observed sample comes from a known parametric density function $f_{X_{n}}$ , the maximum likelihood estimator are obtained by maximizing a likelihood function so that, under the assumed distributive law, the observed data is the most probable.

More precisely considering a realized sample, namely $X_{n} = x_{n}$ where $x_{n} = (x_{1}, \dots, x_{n})$ , then the Maximum Likelihood Estimator (MLE) denoted as ${\hat{θ}}^{ML}$ , maximizes the likelihood (Equation 11.1), i.e. ${\hat{θ}}_{n}^{ML} = \underset{θ \in Θ}{\arg max} [L (θ ∣ x_{n})] .$ Since the logarithm is a monotone function, instead of maximizing directly the likelihood, usually it is preferable to maximize the log-likelihood (Equation 11.2) or minimize the negative log-likelihood, i.e. ${\hat{θ}}_{n}^{ML} = \underset{θ \in Θ}{\arg max} [ℓ (θ ∣ x_{n})] ⟺ {\hat{θ}}_{n}^{ML} = \underset{θ \in Θ}{\arg min} [- ℓ (θ ∣ x_{n})] .$ In general, the true population score (Equation 11.3) is not directly observable, therefore given a realized sample $x_{n}$ , the maximum likelihood estimator solves the system of observed Score equations equal to zero. More precisely, in the vector case the first order conditions (FOC) for the occurrence of a maximum (or a minimum) are $\begin{matrix} (11.6) & \nabla_{ℓ} ({\hat{θ}}_{n}^{ML}) = 0 ⟺ {\begin{cases} \partial_{θ_{1}} ℓ ({\hat{θ}}_{n}^{ML} ∣ x_{n}) = 0 \\ \partial_{θ_{2}} ℓ ({\hat{θ}}_{n}^{ML} ∣ x_{n}) = 0 \\ ⋮ \\ \partial_{θ_{p}} ℓ ({\hat{θ}}_{n}^{ML} ∣ x_{n}) = 0 \end{cases} \end{matrix}$ In general, these equations may not have a closed-form solution, so the estimate is obtained by numerical methods.

The observed Jacobian is the matrix of second-order partial and cross-partial derivatives computed at the MLE estimate $J_{ℓ} ({\hat{θ}}_{n}^{ML} ∣ x_{n}) = (\begin{matrix} \partial_{θ_{1}}^{2} ℓ ({\hat{θ}}_{n}^{ML} ∣ x_{n}) & \partial_{θ_{1}} \partial_{θ_{2}} ℓ ({\hat{θ}}_{n}^{ML} ∣ x_{n}) & \dots & \partial_{θ_{1}} \partial_{θ_{p}} ℓ ({\hat{θ}}_{n}^{ML} ∣ x_{n}) \\ \partial_{θ_{2}} \partial_{θ_{1}} ℓ ({\hat{θ}}_{n}^{ML} ∣ x_{n}) & \partial_{θ_{2}}^{2} ℓ ({\hat{θ}}_{n}^{ML} ∣ x_{n}) & \dots & \partial_{θ_{2}} \partial_{θ_{p}} ℓ ({\hat{θ}}_{n}^{ML} ∣ x_{n}) \\ ⋮ & ⋮ & ⋱ & ⋮ \\ \partial_{θ_{p}} \partial_{θ_{1}} ℓ ({\hat{θ}}_{n}^{ML} ∣ x_{n}) & \partial_{θ_{p}} \partial_{θ_{2}} ℓ ({\hat{θ}}_{n}^{ML} ∣ x_{n}) & \dots & \partial_{θ_{p}}^{2} ℓ ({\hat{θ}}_{n}^{ML} ∣ x_{n}) \end{matrix}),$ and represents the actual curvature of the log-likelihood at the observed sample. To ensure a strict local maximum the observed Jacobian must be negative-definite when computed at the MLE estimate, i.e. when $θ = {\hat{θ}}_{n}^{ML}$ .

Global vs. local maximum

In general, if the Hessian is positive-definite (all eigenvalues positive) at $θ^{ML}$ , then it corresponds to a local minimum. If the Hessian is negative-definite (all eigenvalues negative), then it corresponds to a local maximum. If the Hessian has both positive and negative eigenvalues, then it corresponds to a saddle point (not a maximum, not a minimum).

If the entire log-likelihood function $ℓ$ is globally concave in $θ$ (i.e. $J_{ℓ} (θ)$ is negative semi-definite everywhere), then any local maximum is automatically a global maximum. But if $ℓ$ is not globally concave, the negative definiteness at ${\hat{θ}}^{ML}$ only guarantees a local maximum. Many classical families of distributions have log-concave likelihoods, for example Exponential family distributions (Normal with known variance, Poisson, Bernoulli, Exponential, etc.) have log-likelihoods that are globally concave in their natural parameters. In those cases, the negative definiteness at a stationary point implies that this stationary point is the unique global maximizer. On the other hand, for example Mixture models (e.g. Gaussian mixtures) have non-concave likelihoods and may admit multiple local maxima. The MLE can then be non-unique and may converge to a local maximum.

Invariance property MLE

A property of MLE is the invariance under monotone transformations, i.e. if ${\hat{θ}}_{n}^{ML}$ is the MLE of $θ$ , then for any one-to-one transformation $ϕ = g (θ)$ , the MLE of $ϕ$ is ${\hat{ϕ}}^{ML} = g ({\hat{θ}}_{n}^{ML})$ .

11.1.1 Independent sample

In the special case in which the sample is composed by independent observations the joint density factorize into the product of the single densities, i.e. $f_{X_{n}} (x_{1}, \dots, x_{i}, \dots x_{n} ∣ θ) = \prod_{i = 1}^{n} f_{X_{i}} (x_{i} ∣ θ) .$ Therefore, also the log-likelihood (Equation 11.1) simplifies into the sum of the log-likelihoods where $f_{X_{i}}$ can be different for each random variable $X_{i}$ , i.e. $ℓ (θ ∣ x_{n}) = \sum_{i = 1}^{n} ℓ_{i} (θ ∣ x_{i}),$ where the log-likelihood given the $x_{i}$ observation depends on the density of $X_{i}$ and reads $ℓ_{i} (θ ∣ x_{i}) = \log [f_{X_{i}} (x_{i} ∣ θ)] .$ and the true population Score reads $S_{i} (θ) = E {\nabla_{ℓ_{i}} (θ ∣ X_{i})} .$ Hence, the total gradient vector is obtained as the sum of the gradients per observation, i.e. $\nabla_{ℓ} (θ) = \sum_{i = 1}^{n} \nabla_{ℓ_{i}} (θ ∣ x_{i}) = (\begin{matrix} \sum_{i = 1}^{n} \frac{\partial ℓ_{i} (θ ∣ x_{i})}{\partial θ_{1}} \\ \sum_{i = 1}^{n} \frac{\partial ℓ_{i} (θ ∣ x_{i})}{\partial θ_{2}} \\ ⋮ \\ \sum_{i = 1}^{n} \frac{\partial ℓ_{i} (θ ∣ x_{i})}{\partial θ_{p}} \end{matrix}) .$ and the true population Score is the sum of the scores per observations, i.e. $S (θ) = \sum_{i = 1}^{n} S_{i} (θ),$ Similarly, the Jacobian of the $i$ -th observation reads $J_{ℓ_{i}} (θ ∣ x_{i}) = (\begin{matrix} \partial_{θ_{1}}^{2} ℓ_{i} (θ ∣ x_{i}) & \partial_{θ_{1}} \partial_{θ_{2}} ℓ_{i} (θ ∣ x_{i}) & \dots & \partial_{θ_{1}} \partial_{θ_{p}} ℓ_{i} (θ ∣ x_{i}) \\ \partial_{θ_{2}} \partial_{θ_{1}} ℓ_{i} (θ ∣ x_{i}) & \partial_{θ_{2}}^{2} ℓ_{i} (θ ∣ x_{i}) & \dots & \partial_{θ_{2}} \partial_{θ_{p}} ℓ_{i} (θ ∣ x_{i}) \\ ⋮ & ⋮ & ⋱ & ⋮ \\ \partial_{θ_{p}} \partial_{θ_{1}} ℓ_{i} (θ ∣ x_{i}) & \partial_{θ_{p}} \partial_{θ_{2}} ℓ_{i} (θ ∣ x_{i}) & \dots & \partial_{θ_{p}}^{2} ℓ_{i} (θ ∣ x_{i}) \end{matrix}),$ For independent samples, the Hessian per observation $H_{i} (θ) = E {J_{ℓ_{i}} (θ ∣ X_{i})} .$ and the Fisher information per observation $I_{i} (θ) = - H_{i} (θ) .$ Therefore, the total Hessian in population reads $H (θ) = \sum_{i = 1}^{n} H_{i} (θ) .$ and the total Fisher information $I (θ) = \sum_{i = 1}^{n} I_{i} (θ) .$

11.1.2 IID sample

In the special case in which the sample is composed by independent and identically distributed observations the joint density factorize as $f_{X_{n}} (x_{1}, \dots, x_{i}, \dots x_{n} ∣ θ) = \prod_{i = 1}^{n} f_{X_{1}} (x_{i} ∣ θ) .$ Therefore, also the log-likelihood (Equation 11.1) simplifies $ℓ (θ ∣ x_{n}) = \sum_{i = 1}^{n} ℓ (θ ∣ x_{i}) .$ where $ℓ (θ ∣ x_{i}) = \log [f_{X_{1}} (x_{i} ∣ θ)] .$ In this case, the gradient of the log-likelihood depends only on the different values assumed by $x_{i}$ , i.e. $\nabla_{ℓ} (θ ∣ x_{i}) = (\begin{matrix} \frac{\partial ℓ (θ ∣ x_{i})}{\partial θ_{1}} \\ \frac{\partial ℓ (θ ∣ x_{i})}{\partial θ_{2}} \\ ⋮ \\ \frac{\partial ℓ (θ ∣ x_{i})}{\partial θ_{p}} \end{matrix}),$ and the total gradient reads $\nabla_{ℓ} (θ ∣ x_{n}) = \sum_{i = 1}^{n} \nabla_{ℓ} (θ ∣ x_{i}),$ Therefore, taking the expected value of the gradient one recover the population score (Equation 11.3), i.e. $S (θ) = n E {\nabla_{ℓ} (θ ∣ X_{1})} .$ Similarly, the matrix with the second derivatives $J_{ℓ} (θ ∣ x_{i}) = (\begin{matrix} \partial_{θ_{1}}^{2} ℓ (θ ∣ x_{i}) & \partial_{θ_{1}} \partial_{θ_{2}} ℓ (θ ∣ x_{i}) & \dots & \partial_{θ_{1}} \partial_{θ_{p}} ℓ (θ ∣ x_{i}) \\ \partial_{θ_{2}} \partial_{θ_{1}} ℓ (θ ∣ x_{i}) & \partial_{θ_{2}}^{2} ℓ (θ ∣ x_{i}) & \dots & \partial_{θ_{2}} \partial_{θ_{p}} ℓ (θ ∣ x_{i}) \\ ⋮ & ⋮ & ⋱ & ⋮ \\ \partial_{θ_{p}} \partial_{θ_{1}} ℓ (θ ∣ x_{i}) & \partial_{θ_{p}} \partial_{θ_{2}} ℓ (θ ∣ x_{i}) & \dots & \partial_{θ_{p}}^{2} ℓ (θ ∣ x_{i}) \end{matrix}),$ Hence, Hessian per observation is the same for each $X_{i}$ , i.e. $H_{i} (θ) = E {J_{ℓ} (θ ∣ X_{1})} .$ and the Fisher information per observation $I_{i} (θ) = - E {J_{ℓ} (θ ∣ X_{1})} .$ Therefore the total Hessian in population is $H (θ) = n E {J_{ℓ} (θ ∣ X_{1})} .$ and the total Fisher Information $I (θ) = n E {J_{ℓ} (θ ∣ X_{1})} .$

Exercise 11.1 Let’s consider an IID sample $x_{n} = (x_{1}, \dots, x_{n})$ , where each $x_{i}$ is drawn from a normal distribution $X_{i} \sim N (μ, σ^{2})$ with known variance $σ^{2}$ . The likelihood, given a realized sample and the known variance reads $\begin{matrix} (11.7) & L_{X_{n}} (μ ∣ x_{n}, σ^{2}) = \prod_{i = 1}^{n} \frac{1}{\sqrt{2 π σ^{2}}} \exp (- \frac{(x_{i} - μ)^{2}}{2 σ^{2}}) . \end{matrix}$ Compute the Fisher information according to Equation 11.5.

Solution: Exercise 11.1

Solution 11.1. Let’s consider the given joint density, then by definition (Equation 11.2) the log-likelihood function of the $x_{i}$ observation, given $σ^{2}$ reads: $ℓ (μ ∣ x_{i}, σ^{2}) = - \frac{1}{2} \log (2 π) - \frac{1}{2} [\log (σ^{2}) {(\frac{x_{i} - μ}{σ})}^{2}] .$ Therefore, the total log-likelihood is the sum of the log-likelihoods $ℓ (μ ∣ x_{i}, σ^{2}) = \sum_{i = 1}^{n} ℓ (μ ∣ x_{i}, σ^{2}) = - \frac{n}{2} \log (2 π) - \frac{n}{2} \log (σ^{2}) - \frac{1}{2 σ^{2}} \sum_{i = 1}^{n} {(x_{i} - μ)}^{2} .$ The first derivative of the log-likelihood (Equation 11.3) with respect to the mean parameter $μ$ given the observation $x_{i}$ reads $\frac{\partial ℓ (μ ∣ x_{i}, σ^{2})}{\partial μ} = \frac{x_{i} - μ}{σ^{2}} .$ and the second derivative $\frac{\partial^{2} ℓ (μ ∣ x_{i}, σ^{2})}{\partial μ^{2}} = - \frac{1}{σ^{2}} .$ Therefore, the Score with respect to $μ$ is the sum of the scores, i.e. $\frac{\partial ℓ (μ ∣ x_{n}, σ^{2})}{\partial μ} = \frac{1}{σ^{2}} \sum_{i = 1}^{n} (x_{i} - μ),$ and similarly $\frac{\partial^{2} ℓ (μ ∣ x_{n}, σ^{2})}{\partial μ^{2}} = - \frac{n}{σ^{2}},$ Hence, the Fisher information (Equation 11.5) reads $\begin{aligned} I (μ) & = - n E {\frac{\partial^{2} ℓ (μ ∣ σ^{2}, X_{1})}{\partial μ^{2}}} = \frac{n}{σ^{2}} . \end{aligned}$ Intuitively, since for a Normal random variable a greater $σ^{2}$ implies a more dispersed distribution with respect to the center ( $μ$ ), the Information related to the mean parameter contained in an observed sample with $n$ observations decreases when $σ^{2}$ increases. Moreover, it is interesting to note that as the number of observations increases ( $n \to \infty$ ), the impact of any variance parameter $0 < σ^{2} < \infty$ becomes negligible.

11.1.3 Properties MLE

Let ${\hat{θ}}_{n}^{ML}$ be the MLE estimate of some unknown true population parameter $θ_{0}$ . Then, under the main regularity conditions (i.e. identifiability, differentiability, non-singularity of the Fisher Information), the MLE estimator has two fundamental properties:

Consistency: the MLE is consistent for the true population parameter $θ_{0}$ in the sense that, as the number of observations $n$ increases it converges in probability to the true population parameter, i.e. ${\hat{θ}}_{n}^{ML} \overset{p}{\underset{n \to \infty}{⟶}} θ_{0} .$
Asymptotic Normality: The asymptotic joint distribution of the MLE estimate is Multivariate Normal with dimension $p$ , where $p$ is the number of parameters, i.e. ${\hat{θ}}_{n}^{ML} \overset{d}{\underset{n \to \infty}{⟶}} {MVN}_{p} (θ_{0}, \frac{1}{n} I^{- 1} (θ)),$ equivalently, the distribution of the rescaled estimation error converges to a multivariate normal with mean zero and covariance matrix given by the inverse Fisher information, i.e. $\sqrt{n} ({\hat{θ}}_{n}^{ML} - θ_{0}) \overset{d}{\underset{n \to \infty}{⟶}} {MVN}_{p} (0, I^{- 1} (θ)) .$

11.1.4 MLE in the Gaussian case

Let’s consider a sample of $n$ independent and identically distributed random variables $x_{n} = {x_{1}, x_{2}, \dots, x_{n}}$ . All the observation are extracted from the same probability distribution that is Normal with unknown true population’s parameters $μ$ and $σ^{2}$ . In specific, the joint density of $n$ IID Normal random variables reads as in Equation 11.7. Thus, the likelihood is a function of the unknown parameters given the observed data, i.e. $L (μ, σ^{2} ∣ x_{n}) = \prod_{i = 1}^{n} \frac{1}{\sqrt{2 π σ^{2}}} \exp (- \frac{(x_{i} - μ)^{2}}{2 σ^{2}}) .$ The log-likelihood function (Equation 11.2) is obtained by taking the natural logarithm of the likelihood, i.e. $\begin{aligned} ℓ (μ, σ^{2} ∣ X_{n}) & = \ln L (μ, σ^{2} ∣ X_{n}) = \\ = \sum_{i = 1}^{n} \ln (\frac{1}{\sqrt{2 π σ^{2}}} \exp (- \frac{(x_{i} - μ)^{2}}{2 σ^{2}})) = \\ = \sum_{i = 1}^{n} (- \frac{1}{2} \ln (2 π) - \frac{1}{2} \log (σ^{2}) - \frac{(x_{i} - μ)^{2}}{2 σ^{2}}) = \\ = - \frac{n}{2} \ln (2 π) - \frac{n}{2} \ln (σ^{2}) - \frac{1}{2 σ^{2}} \sum_{i = 1}^{n} (x_{i} - μ)^{2} \end{aligned}$

Partial derivatives with respect to the parameters $θ = (μ, σ^{2})$ . The partial derivative of the log-likelihood with respect to the mean parameter for the $x_{i}$ observation reads $\frac{\partial ℓ_{i} (μ, σ^{2} ∣ x_{i})}{\partial μ} = \frac{x_{i} - μ}{σ^{2}},$ and the total partial derivative reads $\frac{\partial ℓ (μ, σ^{2} ∣ x_{n})}{\partial μ} = \frac{1}{σ^{2}} \sum_{i = 1}^{n} (x_{i} - μ) .$ Similarly, the first derivative of the log-likelihood with respect to the variance parameter reads $\frac{\partial ℓ_{i} (μ, σ^{2} ∣ x_{i})}{\partial σ^{2}} = - \frac{1}{2 σ^{2}} + \frac{(x_{i} - μ)^{2}}{2 σ^{4}},$ and the total partial derivative reads $\frac{\partial ℓ (μ, σ^{2} ∣ x_{n})}{\partial σ^{2}} = - \frac{n}{2 σ^{2}} + \frac{1}{2 σ^{4}} \sum_{i = 1}^{n} (x_{i} - μ)^{2} .$
Score vector: the complete score vector, reads $\nabla_{ℓ} (μ, σ^{2} ∣ x_{n}) = (\begin{matrix} - \frac{1}{σ^{2}} \sum_{i = 1}^{n} (x_{i} - μ) \\ - \frac{n}{2 σ^{2}} + \frac{1}{2 σ^{4}} \sum_{i = 1}^{n} (x_{i} - μ)^{2} \end{matrix}) .$ Notably, if $X_{i}$ are truly Normal then the expected value of the gradient, the score, is zero, i.e. $\begin{aligned} E {\nabla_{ℓ} (μ, σ^{2} ∣ X_{n})} & = (\begin{array}{c} - \frac{1}{σ^{2}} \sum_{i = 1}^{n} (E {x_{i}} - μ) \\ - \frac{n}{2 σ^{2}} + \frac{1}{σ^{4}} \sum_{i = 1}^{n} E {(x_{i} - μ)^{2}} \end{array}) = \\ = (\begin{array}{c} - \frac{1}{σ^{2}} \sum_{i = 1}^{n} (μ - μ) \\ - \frac{n}{2 σ^{2}} + \frac{1}{2 σ^{4}} \sum_{i = 1}^{n} σ^{2} \end{array}) = (\begin{array}{c} 0 \\ 0 \end{array}) \end{aligned}$ at the value of the true parameters $μ$ , $σ^{2}$ , that is equivalent to say that the model is correctly specified.

To find the maximum likelihood estimates of the unknown parameters, one have to search for the values of $μ$ and $σ^{2}$ such that the observed Score equations (Equation 11.3) is equal to zero. In general, this require to solve a system of first order conditions (Equation 11.6), i.e. $\nabla_{ℓ} (μ, σ^{2} ∣ x_{n}) = 0 ⟺ {\begin{cases} \frac{\partial ℓ (μ, σ^{2} ∣ x_{n})}{\partial μ} = 0 \\ \frac{\partial ℓ (μ, σ^{2} ∣ x_{n})}{\partial σ^{2}} = 0 \end{cases}$ Hence, one obtain a system of two equations in two unknowns. From the first equation depending on $μ$ , the solution is exactly the sample mean, i.e. $\frac{\partial ℓ (μ, σ^{2} ∣ x_{n})}{\partial μ} = 0 ⟹ {\hat{μ}}^{ML} = \frac{1}{n} \sum_{i = 1}^{n} x_{i} .$ Then, substituting ${\hat{μ}}^{ML}$ in the second equation gives $\frac{\partial ℓ (μ, σ^{2} ∣ x_{n})}{\partial σ^{2}} = 0 ⟹ {\hat{σ}}^{2 ML} = \frac{1}{n} \sum_{i = 1}^{n} (x_{i} - {\hat{μ}}^{ML})^{2} .$ To compute the Hessian matrix (Equation 11.4), one have to explicit the second and cross derivatives, i.e. $\begin{aligned} \partial_{μ}^{2} ℓ (μ, σ^{2} ∣ x_{n}) = - \frac{n}{σ^{2}}, \\ \partial_{σ^{2}}^{2} ℓ (μ, σ^{2} ∣ x_{n}) = \frac{n}{2 σ^{4}} - \frac{1}{σ^{6}} \sum_{i = 1}^{n} (x_{i} - μ)^{2}, \end{aligned}$ and the cross derivative in $μ$ and $σ^{2}$ , i.e. $\partial_{σ^{2}} \partial_{μ} ℓ (μ, σ^{2} ∣ x_{n}) = - \frac{1}{σ^{4}} \sum_{i = 1}^{n} (x_{i} - μ) .$ Therefore, the Jacobian matrix computed at the MLE estimate reads:
$J_{ℓ} ({\hat{μ}}^{ML}, {\hat{σ}}^{2 ML} ∣ x_{n}) = (\begin{matrix} - \frac{n}{σ^{2 M L E}} & 0 \\ 0 & \frac{n}{2 {\hat{σ}}^{4 ML}} - \frac{1}{{\hat{σ}}^{6 ML}} \sum_{i = 1}^{n} (x_{i} - {\hat{μ}}^{ML})^{2} \end{matrix}) .$ since the cross derivative computed at the MLE estimate is equal to zero. Hence, taking the expectation, the Hessian matrix reads $H ({\hat{μ}}^{ML}, {\hat{σ}}^{2 ML}) = (\begin{matrix} - \frac{n}{σ^{2 M L E}} & 0 \\ 0 & \frac{n}{2 {\hat{σ}}^{4 ML}} - \frac{n {\hat{σ}}^{2 ML}}{{\hat{σ}}^{6 ML}} \end{matrix}) .$ since, $\sum_{i = 1}^{n} (x_{i} - {\hat{μ}}^{ML})^{2} = n {\hat{σ}}^{2 ML}$ . Moreover, the first diagonal element is always less than zero, and also the second diagonal element, i.e. $\frac{n}{2 {\hat{σ}}^{4 ML}} - \frac{n {\hat{σ}}^{2 ML}}{{\hat{σ}}^{6 ML}} = - \frac{n}{2 {\hat{σ}}^{4 ML}} < 0,$ Thus the Hessian is negative definite at the MLE, confirming a strict local maximum. Moreover, the theoretical Fisher information for $n$ IID observations $I ({\hat{μ}}^{ML}, {\hat{σ}}^{2 ML}) = - H (μ^{M L E}, σ^{2 M L E}) = (\begin{matrix} \frac{n}{{\hat{σ}}^{2 ML}} & 0 \\ 0 & \frac{n}{2 {\hat{σ}}^{4 ML}} \end{matrix}) .$ is exactly $n$ -times the Fisher Information of a single observation, i.e. $I_{i} ({\hat{μ}}^{ML}, {\hat{σ}}^{2 ML}) = (\begin{matrix} \frac{1}{{\hat{σ}}^{2 ML}} & 0 \\ 0 & \frac{1}{2 {\hat{σ}}^{4 ML}} \end{matrix}) .$ Therefore, asymptotically, the MLE estimators are normally distributed, i.e. $(\begin{matrix} {\hat{μ}}^{ML} \\ {\hat{σ}}^{2 ML} \end{matrix}) \overset{d}{\underset{n \to \infty}{\sim}} {MVN}_{2} ((\begin{matrix} μ \\ σ^{2} \end{matrix}), \frac{1}{n} (\begin{matrix} σ^{2} & 0 \\ 0 & 2 σ^{4} \end{matrix})) .$

Example: Maximum likelihood estimate for a Normal

Example 11.1 Let’s simulate 5000 observations from an IID Normal distribution with mean $μ = 1$ and variance $σ^{2} = 4$ . Then, given the simulated sample estimate the maximum likelihood parameters.

Maximum likelihood estimate

library(dplyr)
library(purrr)
################ inputs ################ 
set.seed(1)
n_sim <- 5000 # number of simulations
# true parameters
par <- c(mu = 1, sigma2 = 4)
# normal sample 
x <- rnorm(n_sim, mean = par[1], sd = sqrt(par[2]))
########################################
# log-likelihood for a normal 
loglik_norm <- function(params, x){
  -sum(dnorm(x, mean = params[1], sd = params[2], log = TRUE))
}
# Numerical optimization
opt <- optim(c(0, 1), loglik_norm, x = x)
# best mean parameter 
best_par <- tibble(mu = opt$par[1], sigma = opt$par[2], loglik = -opt$value)

$μ^{M L E}$	$μ$	Mean	$σ$	$σ^{M L E}$	Std.deviation
0.994	1	0.994	2	2.053	2.053

Table 11.1: Mean and std. deviation computed on the simulated sample and MLE estimates.

Figure 11.1: Log-likelihood function for a normal sample.

11.2 Quasi-Maximum Likelihood

Quasi-Maximum Likelihood Estimation (QMLE), also known as pseudo-maximum likelihood, is an estimation strategy used when the full distribution may be misspecified but a parametric likelihood can still be posited (e.g., correct conditional mean/variance form, wrong errors). In other words, we assume a parametric likelihood function for the data (perhaps incorrectly) and then maximize it as if it were true.

The QML estimator sacrifices some efficiency (if the model is misspecified there could exists estimators with a lower variance) for for robustness: it converges to the pseudo-true parameter $θ^{⋆}$ (the best approximation within the assumed family), and valid inference follows from robust (sandwich) standard errors. This approach yields the quasi-MLE, an estimator that behaves like the true MLE under certain conditions but remains consistent even if the assumed distribution is wrong (See White (1982)).

11.2.1 QML Estimator

Let $f_{X_{n}}$ be the assumed parametric density function depending on the vector of parameters $θ$ . Then, even if the density $f$ is misspecified, we define the pseudo-true parameter $θ^{⋆}$ as the value that maximizes the expected log-likelihood under the true distribution (see Gourieroux, Monfort, and Trognon (1984)), i.e. $θ^{⋆} = \underset{θ \in Θ}{\arg max} E {ℓ (θ ∣ X_{n})},$ where the expectation is under the true data-generating process. In this case the log-likelihood used is also called pseudo-likelihood because we do not require to work under the true density. Equivalently, $θ^{⋆}$ is the value that minimizes the Kullback–Leibler (KL) distance between the true and the (possibly misspecified) parametric density. If the model is correctly specified, $θ^{⋆} = θ_{0}$ and QMLE coincides with MLE.

The QML estimator is the sample counterpart and maximize the average log-likelihood given the realized sample $x_{n}$ , i.e. ${\hat{θ}}_{n}^{QML} = \underset{θ \in Θ}{\arg max} \frac{1}{n} \sum_{i = 1}^{n} L (θ ∣ x_{i}),$ Then, to find ${\hat{θ}}_{n}^{QML}$ one proceeds exactly like ordinary MLE optimization finding the vector of parameters that maximizes the average log-likelihood.

11.2.2 Properties

Consistency of QMLE: Under the assumption that the model is identifiable at ${\hat{θ}}_{n}^{QML}$ , meaning that the expected log-likelihood has a unique maximum at ${\hat{θ}}_{n}^{QML}$ and standard regularity conditions (e.g. continuity of $f (x; θ)$ in $θ$ ) the QMLE estimator converge in probability as $n \to \infty$ to the pseudo-true parameter, i.e. ${\hat{θ}}_{n}^{QML} \overset{p}{\underset{n \to \infty}{⟶}} θ^{⋆} .$ If the model is correctly specified then the pseudo-true parameter coincides with the actual true parameter, i.e. $θ^{⋆} = θ_{0}$ , and QMLE reduces to ordinary MLE. However, when the model is misspecified the QMLE parameters converges to $θ^{⋆}$ that produces the best approximation of the true distribution, in the sense that it minimize their KL distance.
Asymptotic normality: As the MLE, under regularity conditions, the QMLE is asymptotically normal. However, the asymptotic distribution must account for possible misspecification. More precisely, let’s define the expected information computed at the pseudo-true parameter as $A (θ^{⋆}) = - E {J_{ℓ} (θ^{⋆} ∣ X_{n})}$ and $B (θ^{⋆}) = E {\nabla_{ℓ} (θ^{⋆} ∣ X_{n}) \nabla_{ℓ}^{⊤} (θ^{⋆} ∣ X_{n})}$ Then, a result derived by Huber in the context of M-estimators and White in the context of misspecified likelihoods establishes that under misspecification $A \neq B$ and the asymptotic variance is a sandwich of the two matrices, i.e. $\sqrt{n} ({\hat{θ}}_{n}^{QML} - θ^{⋆}) \overset{d}{\underset{n \to \infty}{\to}} MVN (0, A^{- 1} (θ^{⋆}) B (θ^{⋆}) A^{- 1} (θ^{⋆})),$ If the model is correctly specified, then $θ^{⋆} = θ_{0}$ and one obtain $A (θ_{0}) = - E {J_{ℓ} (θ_{0} ∣ X_{n}) =} = E {\nabla_{ℓ} (θ_{0} ∣ X_{n}) \nabla_{ℓ}^{⊤} (θ_{0} ∣ X_{n})} = B (θ_{0})$ In this case, the the asymptotic variance simplifies and we recover the ML result that is asymptotically efficient. More precisely, the variance covariance matrix of the QMLE $C v {{\hat{θ}}_{n}^{QML}} = C v {{\hat{θ}}_{n}^{ML}} = A^{- 1} (θ_{0})$ A noteworthy point is that QMLE is generally not fully efficient if the model is misspecified: there might exist other estimators targeting the same quantity that have smaller asymptotic variance if one knew more about the true distribution. But QMLE has the convenience of likelihood-based estimation and retains consistency and asymptotic normality without needing the full truth.

11.2.3 Sandwich Standard Errors

Given the asymptotic variance formula above, a crucial practical issue is how to compute standard errors for QMLE estimates. If one naively computes standard errors as if the assumed model were true (for example, using just the inverse Hessian), the inference can be invalid when the model is misspecified.The remedy is to use robust or sandwich standard errors, often called heteroskedasticity-consistent standard errors in econometrics (See White (1980)).

In practice, to compute the robust variance estimate, one proceeds as follows:

First, obtain the QMLE parameters ${\hat{θ}}_{n}^{QML}$ by maximizing the log pseudo log-likelihood.
Secondly, compute the average observed Information matrix at the estimated parameter, i.e. $\hat{A} ({\hat{θ}}_{n}^{QML}) = - \frac{1}{n} \sum_{i = 1}^{n} J_{ℓ} ({\hat{θ}}_{n}^{QML} ∣ x_{n})$
Compute the outer product of gradients matrix at the QMLE: $\hat{B} ({\hat{θ}}_{n}^{QML}) = \frac{1}{n} \sum_{t = 1}^{n} \nabla_{ℓ} ({\hat{θ}}_{n}^{QML} ∣ x_{n}) \nabla_{ℓ}^{⊤} ({\hat{θ}}_{n}^{QML} ∣ x_{n})$ This uses the scores for each observation and does not assume the model is correct. Hence, the estimated variance-covariance matrix of the QML parameters reads: $C v {{\hat{θ}}_{n}^{QML}} = \frac{1}{n} {\hat{A}}^{- 1} ({\hat{θ}}_{n}^{QML}) \hat{B} ({\hat{θ}}_{n}^{QML}) {\hat{A}}^{- 1} ({\hat{θ}}_{n}^{QML})$ where the diagonal elements (scaled by $1 / n$ ) give the variance estimates for each parameter, while the elements outside the diagonal their covariances. Therefore, the variance are obtained as $V {{\hat{θ}}_{n}^{QML}} = \frac{1}{n} diag (C v {{\hat{θ}}_{n}^{QML}})$ if $\hat{A}$ and $\hat{B}$ were computed as averages or without rescale if they are summed forms.

If the model is actually correctly specified, then $\hat{B}$ should be close to $\hat{A}$ in large samples, and the variance reduces to the usual Fisher information based variance ${\hat{A}}^{- 1}$ . Thus, the robust variance estimator is a conservative generalization: it equals the classical one in the ideal case, but remains valid in the non-ideal case.