References: Angela Montanari (2017), Chapter 3. Gardini A. (2000).
Assumptions
Let’s start from the classic assumptions for a linear model. The working hypothesis are:
- The linear model approximate the conditional expectation, i.e. for .
- The conditional variance of the response variable is constant, i.e. with .
- The conditional covariance of the response variable is zero, i.e. with and .
Equivalently the formulation in terms of the stochastic component reads
- for .
- The residuals and the regressors are uncorrelated, i.e. .
- The conditional variance of the residuals is constant, i.e. with .
- The conditional covariance of the residuals is zero, i.e. with and .
Hence, in this setup the error terms are assumed to be independent and identically distributed with equal variance . Thus, the general expression of the covariance matrix in Equation 14.16 reduces to .
Estimator of
Proposition 15.1 ()
The ordinary least squares estimator (OLS) is the function that minimize the sum of the squared residuals and return an estimate of the true parameter , i.e. Formally, the OLS estimator is the solution of the following minimization problem, i.e. Notably, if is non-singular one obtain an analytic expression, i.e.
Equivalently, it is possible to express Equation 15.3 in terms of the covariance matrix of the and , i.e.
Note that the solution is available if and only if is non-singular. Hence, the columns should not be linearly dependent. In fact, one of the -variables can be written as a linear combination of the others, then the determinant of the matrix is zero and the inversion is not possible. Moreover, to have that it is necessary that the number of observations have to be greater or equal than the number of regressors, i.e. .
Proof. Developing the product of the residuals in Equation 15.2: To find the minimum, let’s compute the derivative of with respect to , set it equal to zero and solve for , i.e. To establish if the above solution corresponds also to a global minimum, one must check the sign of the second derivative, i.e. that in this case being always positive denotes a global minimum. An alternative derivation of this estimator, as in Equation 15.4, is obtained by substituting in Equation 15.3.
If in the data matrix was included a column with ones, then the intercept parameter is obtained from Equation 15.3 or Equation 15.4. However, if it was not included, it is computed as:
Projection matrices
Substituting the OLS solution (Equation 15.3) in Equation 14.12 we obtain the matrix , that project the vector on the sub space of generated by the matrix of the regressors , i.e.
The projection matrix satisfies the following three properties, i.e.
- is an symmetric matrix.
- is idempotent.
- .
Substituting the OLS solution (Equation 15.3) in the residuals (Equation 14.15) we obtain another projection matrix , that projects the vector on the orthogonal sub-space with respect to the sub-space generated by the matrix of the regressors , i.e.
The projection matrix satisfies the following 3 properties, i.e.
- is and symmetric.
- is idempotent.
- .
By definition and are orthogonal, i.e. . Hence, the fitted values defined as are the projection of the empiric values on the sub-space generated by . Symmetrically, the fitted residuals are the projection of the empiric values on the sub-space orthogonal to the sub-space generated by .
Proof. Let’s consider the property 2. of , i.e. Let’s consider the property 3. of , i.e. Let’s consider the property 2. of , i.e. Let’s consider the property 3. of , i.e. Finally, let’s prove the orthogonality between and , i.e.
Properties OLS
Theorem 15.1 ()
Under the Gauss-Markov hypothesis the Ordinary Least Square (OLS) estimate is (Best Linear Unbiased Estimator), where “best” stands for the estimator with minimum variance in the class of linear unbiased estimators of the unknown true population parameter . More precisely, the Gauss-Markov hypothesis are:
- .
- .
- , i.e. omoskedasticity.
- is non-stochastic and independent from the errors for all ’s.
Proposition 15.2 ()
1. Unbiased: is correct and it’s conditional expectation is equal to true parameter in population, i.e. 2. Linear in the sense that it can be written as a linear combination of and , i.e. , where do not depend on , i.e.
3. Under the Gauss-Markov hypothesis (Theorem 15.1) is the estimator that has the minimum variance in the class of the unbiased linear estimators of and it’s variance reads:
Proof.
The OLS estimator is correct: it’s expected value is computed from Equation 15.3 and substituting Equation 14.12, is equal to the true parameter in population, i.e.
In general, applying the properties of the variance operator, the variance of is computed as: Then, since is non-stochastic one can bring it outside the variance thus obtaining: Under the Gauss Markov hypothesis (Theorem 15.1) the conditional variance and therefore the Equation 15.11 reduces to:
Variance decomposition
In a linear model, the deviance (or total variance) of the dependent variable can be decomposed into the sum of the regression variance and the dispersion variance. This decomposition helps us understand how much of the total variability in the data is explained by the model and how much is due to unexplained variability (residuals).
Total Deviance (): represents the total variability of the dependent variable . It is calculated as the sum of the squared difference of from its mean .
Regression Deviance (): represents the portion of variability that is explained by the regression model. It is computed as the sum of the squared differences between the fitted values and .
Dispersion Deviance (): represents the portion of variability that is not explained by the model. It is computed as the sum of the squared differences between the observed values and the fitted values (Equation 14.14).
Hence, the total deviance of can be decomposed as follows:
Proof. Let’s prove the expression for the regression deviance , i.e.
Estimator of
The OLS estimator do not depend on variance of the residuals and it is not possible to obtain in one step both the estimators. As far as we know is the variance of the residuals of which we know the realized values on the sample . Hence, let’s define an unbiased estimator of the population variance as: In general the regression variance overestimate the true variance , i.e. Only in the special case where in population, then and also the regression variance produces a correct estimate of .
Proof. By definition, the residuals can be computed pre multiplying the matrix to , i.e. Substituting , since , one obtain Then, being symmetric and idempotent: Thus, since is a scalar, the expected value of the deviance of dispersion read where the trace of the matrix is: Hence the expectation of the deviance of dispersion is equal to
The decomposition of the deviance of holds true also with respect to the correspondents degrees of freedom,
The statistic, also known as the coefficient of determination, is a measure used to assess the goodness of fit of a regression model. In a multivariate context, it evaluates how well the independent variables explain the variability of the dependent variable.
Definition 15.1 ()
The represents the proportion of the variation in the dependent variable that is explained or predicted by the independent variables. Formally, it is defined as the ratio of the deviance explained by the model () to the total deviance (). It can also be expressed as one minus the ratio of the residual deviance () to the total deviance, i.e. Using the variance decomposition (Equation 15.12), it is possible to write a multivariate version of the as:
The numerator represents the variance explained by the regression model, while the denominator the total variance in the dependent variable. The term in the second expression represents the variance of the residuals, or the variance not explained by the model. A value of the close to 1 denotes that a large proportion of the variability of the dependent variable has been explained by the regression model, while a value close to 0 indicates that the model explains very little of the variability.
The elements on the diagonal of the matrix determine the variances while the other elements the covariances. In general the variance of the -th regressor is denoted as where is the -th element on the diagonal of . An alternative expression for the variance is: where is the multivariate coefficient of determination on the regression of on the other regressors. The term is also called Variance Inflation Factor and denoted as .
The statistic has some limitations. Firstly, it can be close to 1 even if the relationship between the variables is not linear. Additionally, increases whenever a new regressor is added to the model, making it unsuitable for comparing models with different numbers of regressors.
Definition 15.2 A more robust indicator that does not always increase with the addition of a new regressor is the adjusted , which is computed as: The can be negative, and its value will always be less than or equal to that of . Unlike , the adjusted version increases only when the new explanatory variable improves the model more than would be expected simply by adding another variable.
Proof. To arrive at the formulation of the adjusted let’s consider that under the null hypothesis the variance of regression (Table 15.1) is a correct estimate of the variance of the residuals . Hence, under : This implies that the expectation of the is not zero (as it should be under ) but: Let’s rescale the such that when holds true it is equal to zero, i.e. However, the specification of implies that when (perfect linear relation between and ) the value of , i.e. . Hence, let’s correct again the indicator such that it takes values in , i.e. Remembering that can be rewritten as in Equation 15.13 one obtain:
Diagnostic
Let’s consider a linear model where the residuals are IID normally distributed random variables. Hence, the working hypothesis of the Gauss Markov theorem holds true.
-test for
A -test valuates the significance of the parameter of a regressor, given the effect of the others regressors, by testing the null hypothesis of linear independence between and , i.e. Under the normality assumption on the distribution of the residuals, the vector of parameter is distributed as a multivariate normal random vector, thus also the marginal distribution of is normal. Therefore, given the expectation and variance of one can standardize it to obtain Substituting the unknown with it’s correct estimate one obtain the statistic, i.e. that is Student- distributed (Equation 35.2) with degrees of freedom. Under the null hypothesis one obtain the -test statistic, i.e.
Confidence intervals for
Under the assumption of normality, from Equation 15.14, one can build a confidence interval for , i.e. where is the confidence level, is the quantile at level of a Student-t distribution with and is the -th element on the diagonal of .
F-test for the regression
The evaluates the significance of the entire regression model by testing the null hypothesis of linear independence between and , i.e. where the only coefficient different from zero is the intercept. In this case, the test statistic reads that is distributed with an -Fischer (Equation 35.3) with and degrees of freedom. is the regression variance and is the dispersion variance. By fixing a significance level , the null hypothesis is rejected if . Remembering the relation between the deviance and the , i.e. and , it is possible to express the -test in terms of the multivariate as:
If the null hypothesis is rejected then:
- The variability of explained by the model is significantly greater than the residual variability.
- At least one of the regressors has a coefficient that is significantly different from zero in the population.
On contrary if is not rejected, then the model is not adequate and there is no evidence of a linear relation between and .
Angela Montanari. 2017. “Appunti Sui Modelli Lineari.”
Gardini A., Costa M., Cavaliere G. 2000. Econometria, Volume Primo. FrancoAngeli.