Let’s consider a sequence of random variables with known parametric joint density defined by a vector of parameters with dimension . Then, can be seen as a function that takes as input a vector in and gives as output a scalar in . In fact, -arguments comes from a possible realized sample , while -arguments from the vector of parameters and is an admissible parameters’ space, i.e. In general, when the set of parameters is considered fixed and the input of the function is only a vector of random variables or scalar in , then is called density function, i.e. On the other hand, when we fix a sample to a particular value and we let the vector of parameters be the input of the function be the vector of parameter one obtain the observed likelihood, i.e. that represents the probability of observing the realized sample (), given a vector of parameters . Usually, the likelihood is denoted as In practice, for a given value of , the likelihood express how likely it is that the data are generated under the distributive law implied by . The observed log-likelihood function is computed by taking the logarithm of the likelihood (Equation 11.1), i.e. When the log-likelihood is differentiable, we can define the observed gradient (Equation 31.9) of the log-likelihood with respect to the parameter’s vector and computed on a realized sample , i.e. and the observed Jacobian (Equation 31.10) of the gradient, i.e. If we consider the gradient as function of the random variables, then the population Score is the expected value of the gradient, i.e. Notably, when the joint density function is correctly specified, the Score computed at the true population parameter is equal to zero, i.e. Similarly, we define the population Hessian the matrix defined as the expected value of the Jacobian of the Score, i.e. Notably, the following relation holds only if the model is correctly specified, otherwise under misspecification the does not anymore. The Fisher information is related to the population Hessian matrix as follows
11.1 ML Estimator
In statistics, a method of estimating the unknown true vector of parameters is the Maximum Likelihood (ML).
Under the assumption that the observed sample comes from a known parametric density function , the maximum likelihood estimator are obtained by maximizing a likelihood function so that, under the assumed distributive law, the observed data is the most probable.
More precisely considering a realized sample, namely where , then the Maximum Likelihood Estimator (MLE) denoted as , maximizes the likelihood (Equation 11.1), i.e. Since the logarithm is a monotone function, instead of maximizing directly the likelihood, usually it is preferable to maximize the log-likelihood (Equation 11.2) or minimize the negative log-likelihood, i.e. In general, the true population score (Equation 11.3) is not directly observable, therefore given a realized sample , the maximum likelihood estimator solves the system of observed Score equations equal to zero. More precisely, in the vector case the first order conditions (FOC) for the occurrence of a maximum (or a minimum) are In general, these equations may not have a closed-form solution, so the estimate is obtained by numerical methods.
The observed Jacobian is the matrix of second-order partial and cross-partial derivatives computed at the MLE estimate and represents the actual curvature of the log-likelihood at the observed sample. To ensure a strict local maximum the observed Jacobian must be negative-definite when computed at the MLE estimate, i.e. when .
Global vs. local maximum
In general, if the Hessian is positive-definite (all eigenvalues positive) at , then it corresponds to a local minimum. If the Hessian is negative-definite (all eigenvalues negative), then it corresponds to a local maximum. If the Hessian has both positive and negative eigenvalues, then it corresponds to a saddle point (not a maximum, not a minimum).
If the entire log-likelihood function is globally concave in (i.e. is negative semi-definite everywhere), then any local maximum is automatically a global maximum. But if is not globally concave, the negative definiteness at only guarantees a local maximum. Many classical families of distributions have log-concave likelihoods, for example Exponential family distributions (Normal with known variance, Poisson, Bernoulli, Exponential, etc.) have log-likelihoods that are globally concave in their natural parameters. In those cases, the negative definiteness at a stationary point implies that this stationary point is the unique global maximizer. On the other hand, for example Mixture models (e.g. Gaussian mixtures) have non-concave likelihoods and may admit multiple local maxima. The MLE can then be non-unique and may converge to a local maximum.
Invariance property MLE
A property of MLE is the invariance under monotone transformations, i.e. if is the MLE of , then for any one-to-one transformation , the MLE of is .
11.1.1 Independent sample
In the special case in which the sample is composed by independent observations the joint density factorize into the product of the single densities, i.e. Therefore, also the log-likelihood (Equation 11.1) simplifies into the sum of the log-likelihoods where can be different for each random variable , i.e. where the log-likelihood given the observation depends on the density of and reads and the true population Score reads Hence, the total gradient vector is obtained as the sum of the gradients per observation, i.e. and the true population Score is the sum of the scores per observations, i.e. Similarly, the Jacobian of the -th observation reads For independent samples, the Hessian per observation and the Fisher information per observation Therefore, the total Hessian in population reads and the total Fisher information
11.1.2 IID sample
In the special case in which the sample is composed by independent and identically distributed observations the joint density factorize as Therefore, also the log-likelihood (Equation 11.1) simplifies where In this case, the gradient of the log-likelihood depends only on the different values assumed by , i.e. and the total gradient reads Therefore, taking the expected value of the gradient one recover the population score (Equation 11.3), i.e. Similarly, the matrix with the second derivatives Hence, Hessian per observation is the same for each , i.e. and the Fisher information per observation Therefore the total Hessian in population is and the total Fisher Information
Exercise 11.1 Let’s consider an IID sample , where each is drawn from a normal distribution with known variance . The likelihood, given a realized sample and the known variance reads Compute the Fisher information according to Equation 11.5.
Solution 11.1. Let’s consider the given joint density, then by definition (Equation 11.2) the log-likelihood function of the observation, given reads: Therefore, the total log-likelihood is the sum of the log-likelihoods The first derivative of the log-likelihood (Equation 11.3) with respect to the mean parameter given the observation reads and the second derivative Therefore, the Score with respect to is the sum of the scores, i.e. and similarly Hence, the Fisher information (Equation 11.5) reads Intuitively, since for a Normal random variable a greater implies a more dispersed distribution with respect to the center (), the Information related to the mean parameter contained in an observed sample with observations decreases when increases. Moreover, it is interesting to note that as the number of observations increases (), the impact of any variance parameter becomes negligible.
11.1.3 Properties MLE
Let be the MLE estimate of some unknown true population parameter . Then, under the main regularity conditions (i.e. identifiability, differentiability, non-singularity of the Fisher Information), the MLE estimator has two fundamental properties:
Consistency: the MLE is consistent for the true population parameter in the sense that, as the number of observations increases it converges in probability to the true population parameter, i.e.
Asymptotic Normality: The asymptotic joint distribution of the MLE estimate is Multivariate Normal with dimension , where is the number of parameters, i.e. equivalently, the distribution of the rescaled estimation error converges to a multivariate normal with mean zero and covariance matrix given by the inverse Fisher information, i.e.
11.1.4 MLE in the Gaussian case
Let’s consider a sample of independent and identically distributed random variables . All the observation are extracted from the same probability distribution that is Normal with unknown true population’s parameters and . In specific, the joint density of IID Normal random variables reads as in Equation 11.7. Thus, the likelihood is a function of the unknown parameters given the observed data, i.e. The log-likelihood function (Equation 11.2) is obtained by taking the natural logarithm of the likelihood, i.e.
Partial derivatives with respect to the parameters . The partial derivative of the log-likelihood with respect to the mean parameter for the observation reads and the total partial derivative reads Similarly, the first derivative of the log-likelihood with respect to the variance parameter reads and the total partial derivative reads
Score vector: the complete score vector, reads Notably, if are truly Normal then the expected value of the gradient, the score, is zero, i.e. at the value of the true parameters , , that is equivalent to say that the model is correctly specified.
To find the maximum likelihood estimates of the unknown parameters, one have to search for the values of and such that the observed Score equations (Equation 11.3) is equal to zero. In general, this require to solve a system of first order conditions (Equation 11.6), i.e. Hence, one obtain a system of two equations in two unknowns. From the first equation depending on , the solution is exactly the sample mean, i.e. Then, substituting in the second equation gives To compute the Hessian matrix (Equation 11.4), one have to explicit the second and cross derivatives, i.e. and the cross derivative in and , i.e. Therefore, the Jacobian matrix computed at the MLE estimate reads: since the cross derivative computed at the MLE estimate is equal to zero. Hence, taking the expectation, the Hessian matrix reads since, . Moreover, the first diagonal element is always less than zero, and also the second diagonal element, i.e. Thus the Hessian is negative definite at the MLE, confirming a strict local maximum. Moreover, the theoretical Fisher information for IID observations is exactly -times the Fisher Information of a single observation, i.e. Therefore, asymptotically, the MLE estimators are normally distributed, i.e.
Example: Maximum likelihood estimate for a Normal
Example 11.1 Let’s simulate 5000 observations from an IID Normal distribution with mean and variance . Then, given the simulated sample estimate the maximum likelihood parameters.
Maximum likelihood estimate
library(dplyr)library(purrr)################ inputs ################ set.seed(1)n_sim <-5000# number of simulations# true parameterspar <-c(mu =1, sigma2 =4)# normal sample x <-rnorm(n_sim, mean = par[1], sd =sqrt(par[2]))######################################### log-likelihood for a normal loglik_norm <-function(params, x){-sum(dnorm(x, mean = params[1], sd = params[2], log =TRUE))}# Numerical optimizationopt <-optim(c(0, 1), loglik_norm, x = x)# best mean parameter best_par <-tibble(mu = opt$par[1], sigma = opt$par[2], loglik =-opt$value)
Mean
Std.deviation
0.994
1
0.994
2
2.053
2.053
Table 11.1: Mean and std. deviation computed on the simulated sample and MLE estimates.
Figure 11.1: Log-likelihood function for a normal sample.
11.2 Quasi-Maximum Likelihood
Quasi-Maximum Likelihood Estimation (QMLE), also known as pseudo-maximum likelihood, is an estimation strategy used when the full distribution may be misspecified but a parametric likelihood can still be posited (e.g., correct conditional mean/variance form, wrong errors). In other words, we assume a parametric likelihood function for the data (perhaps incorrectly) and then maximize it as if it were true.
The QML estimator sacrifices some efficiency (if the model is misspecified there could exists estimators with a lower variance) for for robustness: it converges to the pseudo-true parameter (the best approximation within the assumed family), and valid inference follows from robust (sandwich) standard errors. This approach yields the quasi-MLE, an estimator that behaves like the true MLE under certain conditions but remains consistent even if the assumed distribution is wrong (See White (1982)).
11.2.1 QML Estimator
Let be the assumed parametric density function depending on the vector of parameters . Then, even if the density is misspecified, we define the pseudo-true parameter as the value that maximizes the expected log-likelihood under the true distribution (see Gourieroux, Monfort, and Trognon (1984)), i.e. where the expectation is under the true data-generating process. In this case the log-likelihood used is also called pseudo-likelihood because we do not require to work under the true density. Equivalently, is the value that minimizes the Kullback–Leibler (KL) distance between the true and the (possibly misspecified) parametric density. If the model is correctly specified, and QMLE coincides with MLE.
The QML estimator is the sample counterpart and maximize the average log-likelihood given the realized sample , i.e. Then, to find one proceeds exactly like ordinary MLE optimization finding the vector of parameters that maximizes the average log-likelihood.
11.2.2 Properties
Consistency of QMLE: Under the assumption that the model is identifiable at , meaning that the expected log-likelihood has a unique maximum at and standard regularity conditions (e.g. continuity of in ) the QMLE estimator converge in probability as to the pseudo-true parameter, i.e. If the model is correctly specified then the pseudo-true parameter coincides with the actual true parameter, i.e. , and QMLE reduces to ordinary MLE. However, when the model is misspecified the QMLE parameters converges to that produces the best approximation of the true distribution, in the sense that it minimize their KL distance.
Asymptotic normality: As the MLE, under regularity conditions, the QMLE is asymptotically normal. However, the asymptotic distribution must account for possible misspecification. More precisely, let’s define the expected information computed at the pseudo-true parameter as and Then, a result derived by Huber in the context of M-estimators and White in the context of misspecified likelihoods establishes that under misspecification and the asymptotic variance is a sandwich of the two matrices, i.e. If the model is correctly specified, then and one obtain In this case, the the asymptotic variance simplifies and we recover the ML result that is asymptotically efficient. More precisely, the variance covariance matrix of the QMLE A noteworthy point is that QMLE is generally not fully efficient if the model is misspecified: there might exist other estimators targeting the same quantity that have smaller asymptotic variance if one knew more about the true distribution. But QMLE has the convenience of likelihood-based estimation and retains consistency and asymptotic normality without needing the full truth.
11.2.3 Sandwich Standard Errors
Given the asymptotic variance formula above, a crucial practical issue is how to compute standard errors for QMLE estimates. If one naively computes standard errors as if the assumed model were true (for example, using just the inverse Hessian), the inference can be invalid when the model is misspecified.The remedy is to use robust or sandwich standard errors, often called heteroskedasticity-consistent standard errors in econometrics (See White (1980)).
In practice, to compute the robust variance estimate, one proceeds as follows:
First, obtain the QMLE parameters by maximizing the log pseudo log-likelihood.
Secondly, compute the average observed Information matrix at the estimated parameter, i.e.
Compute the outer product of gradients matrix at the QMLE: This uses the scores for each observation and does not assume the model is correct. Hence, the estimated variance-covariance matrix of the QML parameters reads: where the diagonal elements (scaled by ) give the variance estimates for each parameter, while the elements outside the diagonal their covariances. Therefore, the variance are obtained as if and were computed as averages or without rescale if they are summed forms.
If the model is actually correctly specified, then should be close to in large samples, and the variance reduces to the usual Fisher information based variance . Thus, the robust variance estimator is a conservative generalization: it equals the classical one in the ideal case, but remains valid in the non-ideal case.
Gourieroux, Christian, Alain Monfort, and Alain Trognon. 1984. “Pseudo Maximum Likelihood Methods: Theory.”Econometrica 52 (3): 681–700. https://www.jstor.org/stable/1913471.
White, Halbert. 1980. “A Heteroskedasticity-Consistent Covariance Matrix Estimator and a Direct Test for Heteroskedasticity.”Econometrica 48 (4): 817–38. https://doi.org/10.2307/1912934.