11  Likelihood

Let’s consider a sequence of random variables Xn=(X1,,Xn) with known parametric joint density defined by a vector of parameters θ0 with dimension p. Then, f can be seen as a function that takes as input a vector in Rn+p and gives as output a scalar in R+. In fact, n-arguments comes from a possible realized sample xn=(x1,,xn), while p-arguments from the vector of parameters θ=(θ1,,θp)Θ and Θ is an admissible parameters’ space, i.e.  fX1,,Xn;θ1,,θp:Rn×RpR+ In general, when the set of parameters θ is considered fixed and the input of the function is only a vector of random variables or scalar in Rn, then f is called density function, i.e. fXn(Xnθ):RnR+, On the other hand, when we fix a sample Xn=xn to a particular value and we let the vector of parameters be the input of the function be the vector of parameter one obtain the observed likelihood, i.e.  fXn(θXn=xn):RpR+, that represents the probability of observing the realized sample (Xn=(x1,,xn)), given a vector of parameters θ. Usually, the likelihood is denoted as (11.1)L(θXn=xn)=L(θx1,,xn). In practice, for a given value of θ, the likelihood express how likely it is that the data are generated under the distributive law implied by fX1,,Xn. The observed log-likelihood function is computed by taking the logarithm of the likelihood (), i.e.  (11.2)(θXn=xn)=log(L(θXn=xn)). When the log-likelihood is differentiable, we can define the observed gradient () of the log-likelihood with respect to the parameter’s vector θ and computed on a realized sample xn, i.e. (θxn)=(θ1(θxn)θ2(θxn)θp(θxn)) and the observed Jacobian () of the gradient, i.e.
J(θxn)=(θ12(θxn)θ1θ2(θxn)θ1θp(θxn)θ2θ1(θxn)θ22(θxn)θ2θp(θxn)θpθ1(θxn)θpθ2(θxn)θp2(θxn)), If we consider the gradient as function of the random variables, then the population Score is the expected value of the gradient, i.e.  (11.3)S(θ)=E{(θXn)}, Notably, when the joint density function is correctly specified, the Score computed at the true population parameter θ0 is equal to zero, i.e.
S(θ0)=0. Similarly, we define the population Hessian the matrix defined as the expected value of the Jacobian of the Score, i.e.  (11.4)H(θ)=E{J(θXn)} Notably, the following relation E{J(θXn)}=E{(θXn)(θXn)} holds only if the model is correctly specified, otherwise under misspecification the does not anymore. The Fisher information is related to the population Hessian matrix as follows (11.5)I(θ)=H(θ).

11.1 ML Estimator

In statistics, a method of estimating the unknown true vector of parameters θ0 is the Maximum Likelihood (ML).

Under the assumption that the observed sample comes from a known parametric density function fXn, the maximum likelihood estimator are obtained by maximizing a likelihood function so that, under the assumed distributive law, the observed data is the most probable.

More precisely considering a realized sample, namely Xn=xn where xn=(x1,,xn), then the Maximum Likelihood Estimator (MLE) denoted as θ^ML, maximizes the likelihood (), i.e.  θ^nML=argmaxθΘ[L(θxn)]. Since the logarithm is a monotone function, instead of maximizing directly the likelihood, usually it is preferable to maximize the log-likelihood () or minimize the negative log-likelihood, i.e.  θ^nML=argmaxθΘ[(θxn)]θ^nML=argminθΘ[(θxn)]. In general, the true population score () is not directly observable, therefore given a realized sample xn, the maximum likelihood estimator solves the system of observed Score equations equal to zero. More precisely, in the vector case the first order conditions (FOC) for the occurrence of a maximum (or a minimum) are (11.6)(θ^nML)=0{θ1(θ^nMLxn)=0θ2(θ^nMLxn)=0θp(θ^nMLxn)=0 In general, these equations may not have a closed-form solution, so the estimate is obtained by numerical methods.

The observed Jacobian is the matrix of second-order partial and cross-partial derivatives computed at the MLE estimate J(θ^nMLxn)=(θ12(θ^nMLxn)θ1θ2(θ^nMLxn)θ1θp(θ^nMLxn)θ2θ1(θ^nMLxn)θ22(θ^nMLxn)θ2θp(θ^nMLxn)θpθ1(θ^nMLxn)θpθ2(θ^nMLxn)θp2(θ^nMLxn)), and represents the actual curvature of the log-likelihood at the observed sample. To ensure a strict local maximum the observed Jacobian must be negative-definite when computed at the MLE estimate, i.e. when θ=θ^nML.

Global vs. local maximum

In general, if the Hessian is positive-definite (all eigenvalues positive) at θML, then it corresponds to a local minimum. If the Hessian is negative-definite (all eigenvalues negative), then it corresponds to a local maximum. If the Hessian has both positive and negative eigenvalues, then it corresponds to a saddle point (not a maximum, not a minimum).

If the entire log-likelihood function is globally concave in θ (i.e. J(θ) is negative semi-definite everywhere), then any local maximum is automatically a global maximum. But if is not globally concave, the negative definiteness at θ^ML only guarantees a local maximum. Many classical families of distributions have log-concave likelihoods, for example Exponential family distributions (Normal with known variance, Poisson, Bernoulli, Exponential, etc.) have log-likelihoods that are globally concave in their natural parameters. In those cases, the negative definiteness at a stationary point implies that this stationary point is the unique global maximizer. On the other hand, for example Mixture models (e.g. Gaussian mixtures) have non-concave likelihoods and may admit multiple local maxima. The MLE can then be non-unique and may converge to a local maximum.

Invariance property MLE

A property of MLE is the invariance under monotone transformations, i.e. if θ^nML is the MLE of θ, then for any one-to-one transformation ϕ=g(θ), the MLE of ϕ is ϕ^ML=g(θ^nML).

11.1.1 Independent sample

In the special case in which the sample is composed by independent observations the joint density factorize into the product of the single densities, i.e.  fXn(x1,,xi,xnθ)=i=1nfXi(xiθ). Therefore, also the log-likelihood () simplifies into the sum of the log-likelihoods where fXi can be different for each random variable Xi, i.e.  (θxn)=i=1ni(θxi), where the log-likelihood given the xi observation depends on the density of Xi and reads i(θxi)=log[fXi(xiθ)]. and the true population Score reads Si(θ)=E{i(θXi)}. Hence, the total gradient vector is obtained as the sum of the gradients per observation, i.e.  (θ)=i=1ni(θxi)=(i=1ni(θxi)θ1i=1ni(θxi)θ2i=1ni(θxi)θp). and the true population Score is the sum of the scores per observations, i.e. S(θ)=i=1nSi(θ), Similarly, the Jacobian of the i-th observation reads Ji(θxi)=(θ12i(θxi)θ1θ2i(θxi)θ1θpi(θxi)θ2θ1i(θxi)θ22i(θxi)θ2θpi(θxi)θpθ1i(θxi)θpθ2i(θxi)θp2i(θxi)), For independent samples, the Hessian per observation Hi(θ)=E{Ji(θXi)}. and the Fisher information per observation Ii(θ)=Hi(θ). Therefore, the total Hessian in population reads H(θ)=i=1nHi(θ). and the total Fisher information I(θ)=i=1nIi(θ).

11.1.2 IID sample

In the special case in which the sample is composed by independent and identically distributed observations the joint density factorize as fXn(x1,,xi,xnθ)=i=1nfX1(xiθ). Therefore, also the log-likelihood () simplifies (θxn)=i=1n(θxi). where (θxi)=log[fX1(xiθ)]. In this case, the gradient of the log-likelihood depends only on the different values assumed by xi, i.e.  (θxi)=((θxi)θ1(θxi)θ2(θxi)θp), and the total gradient reads (θxn)=i=1n(θxi), Therefore, taking the expected value of the gradient one recover the population score (), i.e.  S(θ)=nE{(θX1)}. Similarly, the matrix with the second derivatives J(θxi)=(θ12(θxi)θ1θ2(θxi)θ1θp(θxi)θ2θ1(θxi)θ22(θxi)θ2θp(θxi)θpθ1(θxi)θpθ2(θxi)θp2(θxi)), Hence, Hessian per observation is the same for each Xi, i.e.  Hi(θ)=E{J(θX1)}. and the Fisher information per observation Ii(θ)=E{J(θX1)}. Therefore the total Hessian in population is H(θ)=nE{J(θX1)}. and the total Fisher Information I(θ)=nE{J(θX1)}.

Exercise 11.1 Let’s consider an IID sample xn=(x1,,xn), where each xi is drawn from a normal distribution XiN(μ,σ2) with known variance σ2. The likelihood, given a realized sample and the known variance reads (11.7)LXn(μxn,σ2)=i=1n12πσ2exp((xiμ)22σ2). Compute the Fisher information according to .

Solution 11.1. Let’s consider the given joint density, then by definition () the log-likelihood function of the xi observation, given σ2 reads: (μxi,σ2)=12log(2π)12[log(σ2)(xiμσ)2]. Therefore, the total log-likelihood is the sum of the log-likelihoods (μxi,σ2)=i=1n(μxi,σ2)=n2log(2π)n2log(σ2)12σ2i=1n(xiμ)2. The first derivative of the log-likelihood () with respect to the mean parameter μ given the observation xi reads (μxi,σ2)μ=xiμσ2. and the second derivative 2(μxi,σ2)μ2=1σ2. Therefore, the Score with respect to μ is the sum of the scores, i.e.  (μxn,σ2)μ=1σ2i=1n(xiμ), and similarly 2(μxn,σ2)μ2=nσ2, Hence, the Fisher information () reads I(μ)=nE{2(μσ2,X1)μ2}=nσ2. Intuitively, since for a Normal random variable a greater σ2 implies a more dispersed distribution with respect to the center (μ), the Information related to the mean parameter contained in an observed sample with n observations decreases when σ2 increases. Moreover, it is interesting to note that as the number of observations increases (n), the impact of any variance parameter 0<σ2< becomes negligible.

11.1.3 Properties MLE

Let θ^nML be the MLE estimate of some unknown true population parameter θ0. Then, under the main regularity conditions (i.e. identifiability, differentiability, non-singularity of the Fisher Information), the MLE estimator has two fundamental properties:

  1. Consistency: the MLE is consistent for the true population parameter θ0 in the sense that, as the number of observations n increases it converges in probability to the true population parameter, i.e. θ^nMLnpθ0.
  2. Asymptotic Normality: The asymptotic joint distribution of the MLE estimate is Multivariate Normal with dimension p, where p is the number of parameters, i.e.  θ^nMLndMVNp(θ0,1nI1(θ)), equivalently, the distribution of the rescaled estimation error converges to a multivariate normal with mean zero and covariance matrix given by the inverse Fisher information, i.e. n(θ^nMLθ0)ndMVNp(0,I1(θ)).

11.1.4 MLE in the Gaussian case

Let’s consider a sample of n independent and identically distributed random variables xn={x1,x2,,xn}. All the observation are extracted from the same probability distribution that is Normal with unknown true population’s parameters μ and σ2. In specific, the joint density of n IID Normal random variables reads as in . Thus, the likelihood is a function of the unknown parameters given the observed data, i.e.  L(μ,σ2xn)=i=1n12πσ2exp((xiμ)22σ2). The log-likelihood function () is obtained by taking the natural logarithm of the likelihood, i.e. (μ,σ2Xn)=lnL(μ,σ2Xn)==i=1nln(12πσ2exp((xiμ)22σ2))==i=1n(12ln(2π)12log(σ2)(xiμ)22σ2)==n2ln(2π)n2ln(σ2)12σ2i=1n(xiμ)2

  1. Partial derivatives with respect to the parameters θ=(μ,σ2). The partial derivative of the log-likelihood with respect to the mean parameter for the xi observation reads i(μ,σ2xi)μ=xiμσ2, and the total partial derivative reads (μ,σ2xn)μ=1σ2i=1n(xiμ). Similarly, the first derivative of the log-likelihood with respect to the variance parameter reads i(μ,σ2xi)σ2=12σ2+(xiμ)22σ4, and the total partial derivative reads (μ,σ2xn)σ2=n2σ2+12σ4i=1n(xiμ)2.

  2. Score vector: the complete score vector, reads (μ,σ2xn)=(1σ2i=1n(xiμ)n2σ2+12σ4i=1n(xiμ)2). Notably, if Xi are truly Normal then the expected value of the gradient, the score, is zero, i.e. E{(μ,σ2Xn)}=(1σ2i=1n(E{xi}μ)n2σ2+1σ4i=1nE{(xiμ)2})==(1σ2i=1n(μμ)n2σ2+12σ4i=1nσ2)=(00) at the value of the true parameters μ, σ2, that is equivalent to say that the model is correctly specified.

To find the maximum likelihood estimates of the unknown parameters, one have to search for the values of μ and σ2 such that the observed Score equations () is equal to zero. In general, this require to solve a system of first order conditions (), i.e.  (μ,σ2xn)=0{(μ,σ2xn)μ=0(μ,σ2xn)σ2=0 Hence, one obtain a system of two equations in two unknowns. From the first equation depending on μ, the solution is exactly the sample mean, i.e.  (μ,σ2xn)μ=0μ^ML=1ni=1nxi. Then, substituting μ^ML in the second equation gives (μ,σ2xn)σ2=0σ^2ML=1ni=1n(xiμ^ML)2. To compute the Hessian matrix (), one have to explicit the second and cross derivatives, i.e.  μ2(μ,σ2xn)=nσ2,σ22(μ,σ2xn)=n2σ41σ6i=1n(xiμ)2, and the cross derivative in μ and σ2, i.e.  σ2μ(μ,σ2xn)=1σ4i=1n(xiμ). Therefore, the Jacobian matrix computed at the MLE estimate reads:
J(μ^ML,σ^2MLxn)=(nσ2MLE00n2σ^4ML1σ^6MLi=1n(xiμ^ML)2). since the cross derivative computed at the MLE estimate is equal to zero. Hence, taking the expectation, the Hessian matrix reads H(μ^ML,σ^2ML)=(nσ2MLE00n2σ^4MLnσ^2MLσ^6ML). since, i=1n(xiμ^ML)2=nσ^2ML. Moreover, the first diagonal element is always less than zero, and also the second diagonal element, i.e.  n2σ^4MLnσ^2MLσ^6ML=n2σ^4ML<0, Thus the Hessian is negative definite at the MLE, confirming a strict local maximum. Moreover, the theoretical Fisher information for n IID observations I(μ^ML,σ^2ML)=H(μMLE,σ2MLE)=(nσ^2ML00n2σ^4ML). is exactly n-times the Fisher Information of a single observation, i.e.  Ii(μ^ML,σ^2ML)=(1σ^2ML0012σ^4ML). Therefore, asymptotically, the MLE estimators are normally distributed, i.e. (μ^MLσ^2ML)ndMVN2((μσ2),1n(σ2002σ4)).

Example 11.1 Let’s simulate 5000 observations from an IID Normal distribution with mean μ=1 and variance σ2=4. Then, given the simulated sample estimate the maximum likelihood parameters.

Maximum likelihood estimate
library(dplyr)
library(purrr)
################ inputs ################ 
set.seed(1)
n_sim <- 5000 # number of simulations
# true parameters
par <- c(mu = 1, sigma2 = 4)
# normal sample 
x <- rnorm(n_sim, mean = par[1], sd = sqrt(par[2]))
########################################
# log-likelihood for a normal 
loglik_norm <- function(params, x){
  -sum(dnorm(x, mean = params[1], sd = params[2], log = TRUE))
}
# Numerical optimization
opt <- optim(c(0, 1), loglik_norm, x = x)
# best mean parameter 
best_par <- tibble(mu = opt$par[1], sigma = opt$par[2], loglik = -opt$value)
μMLE μ Mean σ σMLE Std.deviation
0.994 1 0.994 2 2.053 2.053
Table 11.1: Mean and std. deviation computed on the simulated sample and MLE estimates.
Figure 11.1: Log-likelihood function for a normal sample.

11.2 Quasi-Maximum Likelihood

Quasi-Maximum Likelihood Estimation (QMLE), also known as pseudo-maximum likelihood, is an estimation strategy used when the full distribution may be misspecified but a parametric likelihood can still be posited (e.g., correct conditional mean/variance form, wrong errors). In other words, we assume a parametric likelihood function for the data (perhaps incorrectly) and then maximize it as if it were true.

The QML estimator sacrifices some efficiency (if the model is misspecified there could exists estimators with a lower variance) for for robustness: it converges to the pseudo-true parameter θ (the best approximation within the assumed family), and valid inference follows from robust (sandwich) standard errors. This approach yields the quasi-MLE, an estimator that behaves like the true MLE under certain conditions but remains consistent even if the assumed distribution is wrong (See White ()).

11.2.1 QML Estimator

Let fXn be the assumed parametric density function depending on the vector of parameters θ. Then, even if the density f is misspecified, we define the pseudo-true parameter θ as the value that maximizes the expected log-likelihood under the true distribution (see Gourieroux, Monfort, and Trognon ()), i.e.  θ=argmaxθΘE{(θXn)}, where the expectation is under the true data-generating process. In this case the log-likelihood used is also called pseudo-likelihood because we do not require to work under the true density. Equivalently, θ is the value that minimizes the Kullback–Leibler (KL) distance between the true and the (possibly misspecified) parametric density. If the model is correctly specified, θ=θ0 and QMLE coincides with MLE.

The QML estimator is the sample counterpart and maximize the average log-likelihood given the realized sample xn, i.e.  θ^nQML=argmaxθΘ1ni=1nL(θxi), Then, to find θ^nQML one proceeds exactly like ordinary MLE optimization finding the vector of parameters that maximizes the average log-likelihood.

11.2.2 Properties

  1. Consistency of QMLE: Under the assumption that the model is identifiable at θ^nQML, meaning that the expected log-likelihood has a unique maximum at θ^nQML and standard regularity conditions (e.g. continuity of f(x;θ) in θ) the QMLE estimator converge in probability as n to the pseudo-true parameter, i.e.  θ^nQMLnpθ. If the model is correctly specified then the pseudo-true parameter coincides with the actual true parameter, i.e. θ=θ0, and QMLE reduces to ordinary MLE. However, when the model is misspecified the QMLE parameters converges to θ that produces the best approximation of the true distribution, in the sense that it minimize their KL distance.

  2. Asymptotic normality: As the MLE, under regularity conditions, the QMLE is asymptotically normal. However, the asymptotic distribution must account for possible misspecification. More precisely, let’s define the expected information computed at the pseudo-true parameter as A(θ)=E{J(θXn)} and B(θ)=E{(θXn)(θXn)} Then, a result derived by Huber in the context of M-estimators and White in the context of misspecified likelihoods establishes that under misspecification AB and the asymptotic variance is a sandwich of the two matrices, i.e. n(θ^nQMLθ)ndMVN(0,A1(θ)B(θ)A1(θ)), If the model is correctly specified, then θ=θ0 and one obtain A(θ0)=E{J(θ0Xn)=}=E{(θ0Xn)(θ0Xn)}=B(θ0) In this case, the the asymptotic variance simplifies and we recover the ML result that is asymptotically efficient. More precisely, the variance covariance matrix of the QMLE Cv{θ^nQML}=Cv{θ^nML}=A1(θ0) A noteworthy point is that QMLE is generally not fully efficient if the model is misspecified: there might exist other estimators targeting the same quantity that have smaller asymptotic variance if one knew more about the true distribution. But QMLE has the convenience of likelihood-based estimation and retains consistency and asymptotic normality without needing the full truth.

11.2.3 Sandwich Standard Errors

Given the asymptotic variance formula above, a crucial practical issue is how to compute standard errors for QMLE estimates. If one naively computes standard errors as if the assumed model were true (for example, using just the inverse Hessian), the inference can be invalid when the model is misspecified.The remedy is to use robust or sandwich standard errors, often called heteroskedasticity-consistent standard errors in econometrics (See White ()).

In practice, to compute the robust variance estimate, one proceeds as follows:

  1. First, obtain the QMLE parameters θ^nQML by maximizing the log pseudo log-likelihood.

  2. Secondly, compute the average observed Information matrix at the estimated parameter, i.e.  A^(θ^nQML)=1ni=1nJ(θ^nQMLxn)

  3. Compute the outer product of gradients matrix at the QMLE: B^(θ^nQML)=1nt=1n(θ^nQMLxn)(θ^nQMLxn) This uses the scores for each observation and does not assume the model is correct. Hence, the estimated variance-covariance matrix of the QML parameters reads: Cv{θ^nQML}=1nA^1(θ^nQML)B^(θ^nQML)A^1(θ^nQML) where the diagonal elements (scaled by 1/n) give the variance estimates for each parameter, while the elements outside the diagonal their covariances. Therefore, the variance are obtained as V{θ^nQML}=1ndiag(Cv{θ^nQML}) if A^ and B^ were computed as averages or without rescale if they are summed forms.

If the model is actually correctly specified, then B^ should be close to A^ in large samples, and the variance reduces to the usual Fisher information based variance A^1. Thus, the robust variance estimator is a conservative generalization: it equals the classical one in the ideal case, but remains valid in the non-ideal case.