15  Classic least squares

Last modified

June 4, 2026

References: Angela Montanari (2017), Chapter 3. Gardini A. (2007).

15.1 Assumptions

Let’s start from the classic assumptions for a linear model. The working hypotheses are:

  1. The linear model approximates the conditional expectation, i.e. \(\mathbb{E}\{Y_i \mid \mathbf{x}_i\} = \mathbf{x}_i^{\top} \mathbf{b}\).
  2. The conditional variance of the response variable \(y\) is constant, i.e. \(\mathbb{V}\{Y_i \mid \mathbf{x}_i\} = \sigma_{\text{u}}^2\) with \(0 < \sigma_{\text{u}}^2 < \infty\).
  3. The response variables \(Y\) are uncorrelated, i.e. \(\mathbb{C}v\{Y_i, Y_j \mid \mathbf{x}_i\} = 0\) with \(i \neq j\) and \(i,j \in\{1, \dots, n\}\).

Equivalently the formulation in terms of the stochastic component reads

  1. The residuals have mean zero, i.e. \(\mathbb{E}\{u_i \mid \mathbf{x}_i\} = 0\) for all \(i\) with \(i = 1, \dots, n\).
  2. The conditional variance of the residuals is constant, i.e. \(\mathbb{V}\{u_i \mid \mathbf{x}_i\} = \sigma_{\text{u}}^2\) with \(0 < \sigma_{\text{u}}^2 < \infty\).
  3. The residuals and the regressors are uncorrelated, i.e. \(\mathbb{C}v\{u_i, u_j \mid \mathbf{x}_i\} = 0\) for all \(i \neq j\) and \(i,j = 1, \dots, n\).

Hence, in this setup the error terms are assumed to be independent and identically distributed with mean zero and equal variances \(\sigma_i^2 = \sigma^2\) for all \(i\). Thus, the general expression of the covariance matrix in Equation 14.15 reduces to \(\boldsymbol{\Sigma} = \sigma_{\text{u}}^2 \textbf{I}_{n}\).

Generate the matrix of data
# True variance-covariance matrix
sigma <- diag(1, 4)
sigma[2,2] <- 1.3
sigma[3,3] <- 0.7
sigma[4,4] <- 1.7
sigma[1,2] <- sigma[2,1] <- 0.3
sigma[1,3] <- sigma[3,1] <- 0.4
sigma[1,4] <- sigma[4,1] <- -0.8
# Simulate data
set.seed(1)
W <- mvtnorm::rmvnorm(10, sigma = sigma)
y <- W[,1]
X <- cbind(1, W[,-1])
n <- nrow(X)
k <- ncol(X)

15.2 Estimator of \(b\)

Proposition 15.1 (Ordinary Least Squares (OLS) estimator) The ordinary least squares estimator (OLS) is obtained by minimizing the sum of the squared residuals expressed by the function \[ \text{Q}^{\tiny\text{OLS}}(\mathbf{b}) = \hat{\mathbf{u}}(\mathbf{b})^{\top} \hat{\mathbf{u}}(\mathbf{b}) \text{.} \tag{15.1}\] that returns an estimate of the parameter \(\mathbf{b}\). In this context, the fitted residuals are seen as a function of the unknown \(\mathbf{b}\) (Equation 14.14).

Formally, the OLS estimator is the solution of the following minimization problem, i.e. \[ \mathbf{b}^{\tiny\text{OLS}} = \underset{\mathbf{b} \in \Theta_{\mathbf{b}}}{\text{argmin}} \left\{\text{Q}^{\tiny\text{OLS}}(\mathbf{b})\right\} \text{,} \tag{15.2}\] where \(\Theta_{\mathbf{b}} \subset \mathbb{R}^k\) is the parameter space. Notably, if \(\mathbf{X}\) is non-singular, one obtains an analytic expression, i.e. \[ \underset{k \times 1}{\mathbf{b}^{\tiny\text{OLS}}} = (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} \mathbf{y} \text{.} \tag{15.3}\] Equivalently, it is possible to express Equation 15.3 in terms of the covariance matrix of the \(\mathbf{X}\) and \(\mathbf{y}\), i.e. \[ \mathbf{b}^{\tiny\text{OLS}} = \frac{\mathbb{C}v\{\mathbf{X}, \mathbf{Y}\}}{\mathbb{V}\{\mathbf{X}\}} \text{.} \tag{15.4}\]

Proof. Developing the product of the residuals in Equation 15.2: \[ \begin{aligned} \text{Q}^{\tiny\text{OLS}}(\mathbf{b}) {} & = \hat{\mathbf{u}}(\mathbf{b})^{\top} \hat{\mathbf{u}}(\mathbf{b}) = \\ & = (\mathbf{y} - \mathbf{X} \mathbf{b})^{\top} (\mathbf{y} - \mathbf{X} \mathbf{b}) = \\ & = \mathbf{y}^{\top} \mathbf{y} - \mathbf{b}^{\top} \mathbf{X}^{\top} \mathbf{y} - \mathbf{y}^{\top} \mathbf{X} \mathbf{b} + \mathbf{b}^{\top} \mathbf{X}^{\top} \mathbf{X} \mathbf{b} = \\ & = \mathbf{y}^{\top} \mathbf{y} - 2 \mathbf{b}^{\top} \mathbf{X}^{\top} \mathbf{y} + \mathbf{b}^{\top} \mathbf{X}^{\top} \mathbf{X} \mathbf{b} \end{aligned} \tag{15.5}\] To find the minimum, let’s compute the derivative of \(\text{Q}^{\tiny\text{OLS}}\) with respect to \(\mathbf{b}\), set it equal to zero and solve for \(\mathbf{b} = \mathbf{b}^{\tiny\text{OLS}}\), i.e. \[ \begin{aligned} & {} \partial_{\mathbf{b}}\text{Q}^{\tiny\text{OLS}}(\mathbf{b}) = - 2 \mathbf{X}^{\top} \mathbf{y} + 2 \mathbf{X}^{\top} \mathbf{X} \mathbf{b} = \mathbf{0} \\ & \implies \mathbf{X}^{\top} \mathbf{y} = \mathbf{X}^{\top} \mathbf{X} \mathbf{b} \\ & \implies \mathbf{b}^{\tiny\text{OLS}} = (\mathbf{X}^{\top} \mathbf{X})^{-1}\mathbf{X}^{\top} \mathbf{y} \end{aligned} \] To establish if the above solution corresponds also to a global minimum, one must check the sign of the second derivative, i.e. \[ \partial_{\mathbf{b}}^2 Q^{\tiny\text{OLS}}(\mathbf{b}) = 2 \mathbf{X}^{\top} \mathbf{X} > 0 \text{,} \] that, in this case, is always positive definite and denotes a global minimum. An alternative derivation of this estimator, as in Equation 15.4, is obtained by estimating the variance-covariance matrix \(\mathbb{C}v\{\mathbf{X}, \mathbf{Y}\}\) as in Equation 13.4 and the variance matrix \(\mathbb{V}\{\mathbf{X}\}\) as in Equation 13.5.

OLS estimator of beta
b <- solve(t(X) %*% X) %*% t(X) %*% y
Singularity of \(\mathbf{X}\)

Note that the solution is available if and only if \(\mathbf{X}\) is non-singular. Hence, the columns should not be linearly dependent. In fact, if one of the \(k\) variables can be written as a linear combination of the others, then the determinant of the matrix \(\mathbf{X}^{\top} \mathbf{X}\) is zero and the inversion is not possible. Moreover, to have \(\text{rank}(\mathbf{X}^{\top}\mathbf{X}) = k\), the number of observations must be greater than or equal to the number of regressors, i.e. \(n \ge k\).

Intercept estimate

If a column of ones was included in the data matrix \(\mathbf{X}\), then the intercept parameter is obtained from Equation 15.3 or Equation 15.4. However, if it was not included, it is computed as: \[ \alpha^{\tiny\text{OLS}} = \mathbb{E}\{\mathbf{Y}- \mathbf{X}\mathbf{b}^{\tiny\text{OLS}}\} \text{.} \]

OLS estimator of the intercept
# Vector of ones
J_1n <- matrix(1, nrow = 1, ncol = n)
# Vector of means
x_bar <- t((J_1n %*% X)/n)
# OLS estimator
a <- mean(y - X %*% b)

15.2.1 Projection matrices

Substituting the OLS solution (Equation 15.3) in Equation 14.11, we obtain the matrix \(\mathbf{P}_{\mathbf{x}}\), which projects the vector \(\mathbf{y}\) on the subspace of \(\mathbb{R}^{n}\) generated by the matrix of the regressors \(\mathbf{X}\), i.e. \[ \mathbf{P}_{\mathbf{x}} = \mathbf{X} (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} \text{.} \tag{15.6}\]

Projection matrix \(\mathbf{P}_{\mathbf{x}}\)
P_x <- X %*% solve(t(X) %*% X) %*% t(X)

Proposition 15.2 (Properties Matrix P_x) The projection matrix \(\mathbf{P}_{\mathbf{x}}\) satisfies the following three properties, i.e.

  1. \(\mathbf{P}_{\mathbf{x}}\) is an \(n \times n\) symmetric matrix.
  2. \(\mathbf{P}_{\mathbf{x}} \ \mathbf{P}_{\mathbf{x}} = \mathbf{P}_{\mathbf{x}}\) is idempotent.
  3. \(\mathbf{P}_{\mathbf{x}} \ \mathbf{X} = \mathbf{X}\).
Properties of matrix \(\mathbf{M}_{\mathbf{x}}\)
# Property 2.
# sum((P_x %*% P_x - P_x)^2) # close to zero
# Property 3.
# sum((P_x %*% X - X)^2) # close to zero

Proof. Let’s consider the property 2. of \(\mathbf{P}_{\mathbf{x}}\), i.e. \[ \begin{aligned} \mathbf{P}_{\mathbf{x}} \ \mathbf{P}_{\mathbf{x}} {} & = \left( \mathbf{X} (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} \right) \left(\mathbf{X} (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top}\right) = \\ & = (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} = \\ & = \mathbf{P}_{\mathbf{x}} \end{aligned} \] Let’s consider the property 3. of \(\mathbf{P}_{\mathbf{x}}\), i.e. \[ \mathbf{P}_{\mathbf{x}} \ \mathbf{X} = \left[\mathbf{X} (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} \right] \mathbf{X} = \mathbf{X} \text{.} \]

Substituting the OLS solution (Equation 15.3) in the residuals (Equation 14.14), we obtain another projection matrix \(\mathbf{M}_{\mathbf{x}}\), which projects the vector \(\mathbf{y}\) on the orthogonal subspace with respect to the subspace generated by the matrix of the regressors \(\mathbf{X}\), i.e. \[ \mathbf{M}_{\mathbf{x}} = \textbf{I}_n - \mathbf{P}_{\mathbf{x}} \text{.} \tag{15.7}\] where \(\textbf{I}_n\) is the identity matrix (Equation 31.3).

Projection matrix \(\mathbf{M}_{\mathbf{x}}\)
M_x <- diag(1, n, n) - P_x

Proposition 15.3 (Properties Matrix M_x) The projection matrix \(\mathbf{M}_{\mathbf{x}}\) satisfies the following 3 properties, i.e.

  1. \(\mathbf{M}_{\mathbf{x}}\) is \(n \times n\) and symmetric.
  2. \(\mathbf{M}_{\mathbf{x}} \ \mathbf{M}_{\mathbf{x}} = \mathbf{M}_{\mathbf{x}}\) is idempotent.
  3. \(\mathbf{M}_{\mathbf{x}} \ \mathbf{X} = \mathbf{0}\).
Properties of matrix \(\mathbf{M}_{\mathbf{x}}\)
# Property 2.
# sum((M_x %*% M_x - M_x)^2) # close to zero
# Property 3.
# sum((M_x %*% X)^2) # close to zero

Proof. Let’s consider the property 2. of \(\mathbf{M}_{\mathbf{x}}\), i.e. \[ \begin{aligned} \mathbf{M}_{\mathbf{x}} \mathbf{M}_{\mathbf{x}} {} & = \left(\textbf{I}_n - \mathbf{P}_{\mathbf{x}} \right)\left(\textbf{I}_n - \mathbf{P}_{\mathbf{x}} \right) = \\ & = \textbf{I}_n - \mathbf{P}_{\mathbf{x}} = \\ & = \mathbf{M}_{\mathbf{x}} \end{aligned} \] Let’s consider the property 3. of \(\mathbf{M}_{\mathbf{x}}\), i.e. \[ \begin{aligned} \mathbf{M}_{\mathbf{x}} \mathbf{X} {} & = (\textbf{I}_n - \mathbf{P}_{\mathbf{x}}) \mathbf{X} = \\ & = \left(\textbf{I}_n - \mathbf{X} (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} \right) \mathbf{X} = \\ & = \mathbf{X} - \mathbf{X} = \mathbf{0} \end{aligned} \]

Remark 15.1. By definition, \(\mathbf{M}_{\mathbf{x}}\) and \(\mathbf{P}_{\mathbf{x}}\) are orthogonal, i.e. \(\mathbf{P}_{\mathbf{x}} \ \mathbf{M}_{\mathbf{x}} = \mathbf{0}\). Hence, the fitted values defined as \(\hat{\mathbf{y}} =\mathbf{P}_{\mathbf{x}}\mathbf{y}\) are the projection of the empirical values on the subspace generated by \(\mathbf{X}\). Symmetrically, the fitted residuals \(\hat{\mathbf{u}} =\mathbf{M}_{\mathbf{x}}\mathbf{y}\) are the projection of the empirical values on the subspace orthogonal to the subspace generated by \(\mathbf{X}\).

Proof. Let’s prove the orthogonality between \(\mathbf{M}_{\mathbf{x}}\) and \(\mathbf{P}_{\mathbf{x}}\), i.e. \[ \mathbf{P}_{\mathbf{x}} \mathbf{M}_{\mathbf{x}} = \mathbf{P}_{\mathbf{x}} \left(\textbf{I}_n - \mathbf{P}_{\mathbf{x}} \right) = \mathbf{P}_{\mathbf{x}} - \mathbf{P}_{\mathbf{x}} = \mathbf{0} \text{.} \]

Orthogonality of matrix \(\mathbf{P}_{\mathbf{x}}\) and \(\mathbf{M}_{\mathbf{x}}\)
# sum(M_x %*% P_x) # close to zero

15.2.2 Properties OLS

Theorem 15.1 (Gauss-Markov theorem) Under the Gauss-Markov hypotheses, the Ordinary Least Squares (OLS) estimate is \({\color{blue}{\textbf{BLUE}}}\) (Best Linear Unbiased Estimator), where “best” stands for the estimator with minimum variance in the class of linear unbiased estimators of the unknown true population parameter \(\mathbf{b}\). More precisely, the Gauss-Markov hypotheses are:

  1. \(\mathbf{y} = \mathbf{X} \mathbf{b} + \mathbf{u}\).
  2. \(\mathbb{E}\{\mathbf{u}\} = 0\).
  3. \(\mathbb{E}\{\mathbf{u} \mathbf{u}^{\top}\} = \sigma_{\text{u}}^{2} \textbf{I}_{n}\), i.e. homoskedasticity.
  4. \(\mathbf{X}\) is non-stochastic and independent from the errors for all \(n\)’s.

Proposition 15.4 (Properties OLS estimator)  

  1. Unbiased: \(\mathbf{b}^{\tiny\text{OLS}}\) is correct and its conditional expectation is equal to the true parameter in the population, i.e. \[ \mathbb{E}\{\mathbf{b}^{\tiny\text{OLS}} \mid \mathbf{X}\} = \mathbf{b} \text{.} \tag{15.8}\]
  2. Linear in the sense that it can be written as a linear combination of \(\mathbf{y}\) and \(\mathbf{X}\), i.e. \(\mathbf{b}^{\tiny\text{OLS}} = \mathbf{A}_{\text{x}} \mathbf{y}\), where \(\mathbf{A}_{\text{x}}\) does not depend on \(\mathbf{y}\), i.e. \[ \mathbf{b}^{\tiny\text{OLS}} = \mathbf{A}_{\text{x}} \mathbf{y}, \quad \mathbf{A}_{\text{x}} = (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} \text{.} \tag{15.9}\]
  3. Under the Gauss-Markov hypotheses (Theorem 15.1), \(\mathbf{b}^{\tiny\text{OLS}}\) is the estimator that has the minimum variance in the class of the unbiased linear estimators of \(\mathbf{b}\) and its variance reads: \[ \mathbb{V}\{\mathbf{b}^{\tiny\text{OLS}} \mid \mathbf{X}\} = \sigma_{\text{u}}^2 (\mathbf{X}^{\top} \mathbf{X})^{-1} \text{.} \tag{15.10}\] Denoting with \(c_{jj}\) the \(j\)-th element on the diagonal of \((\mathbf{X}^{\top} \mathbf{X})^{-1}\), the variance of the \(j\)-th regressor reads \[ \mathbb{V}\{b_j^{\tiny\text{OLS}} \mid \mathbf{X}\} = \sigma_{\text{u}}^2 \cdot (\mathbf{X}^{\top} \mathbf{X})^{-1}_{[j,j]} \text{.} \tag{15.11}\] where \((\mathbf{X}^{\top} \mathbf{X})^{-1}_{[j,j]}\) denotes the element on the diagonal in the position \([j,j]\).
Matrix \((\mathbf{X}^{\top} \mathbf{X})^{-1}\)
tXX <- solve(t(X) %*% X)

Proof.

  1. The OLS estimator is correct: its expected value is computed from Equation 15.3 and, substituting Equation 14.11, is equal to the true parameter in the population, i.e. \[ \begin{aligned} \mathbb{E}\{\mathbf{b}^{\tiny\text{OLS}} \mid \mathbf{X}\} {} & = \mathbb{E}\{(\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} \mathbf{y} \mid \mathbf{X} \} = \\ & = \mathbb{E}\{(\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} (\mathbf{X} \mathbf{b} + \mathbf{u}) \mid \mathbf{X} \} = \\ & = (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} \mathbf{X} \mathbf{b} + (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} \mathbb{E}\{\mathbf{u} \mid \mathbf{X}\} = \\ & = \mathbf{b} \end{aligned} \]

  2. In general, applying the properties of the variance operator, the variance of \(\mathbf{b}^{\tiny\text{OLS}}\) is computed as: \[ \begin{aligned} \mathbb{V}\{\mathbf{b}^{\tiny\text{OLS}} \mid \mathbf{X}\} {} & = \mathbb{V}\{(\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} \mathbf{y} \mid \mathbf{X}\} = \\ & = \mathbb{V}\{(\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} (\mathbf{X}\mathbf{b} + \mathbf{u}) \mid \mathbf{X}\} = \\ & = \mathbb{V}\{(\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} \mathbf{X}\mathbf{b} + (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top}\mathbf{u} \mid \mathbf{X}\} = \\ & = \mathbb{V}\{\mathbf{b} + (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top}\mathbf{u} \mid \mathbf{X} \} = \\ & = \mathbb{V}\{(\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top}\mathbf{u} \mid \mathbf{X} \} \end{aligned} \] Then, since \(\mathbf{X}\) is non-stochastic one can bring it outside the variance thus obtaining: \[ \begin{aligned} \mathbb{V}\{\mathbf{b}^{\tiny\text{OLS}} \mid \mathbf{X}\} {} & = (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} \mathbb{V}\{\mathbf{u} \mid \mathbf{X} \} \, \mathbf{X} (\mathbf{X}^{\top} \mathbf{X})^{-1} = \\ & = (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} \mathbb{E}\{\mathbf{u} \mathbf{u}^{\top} \mid \mathbf{X} \} \, \mathbf{X} (\mathbf{X}^{\top} \mathbf{X})^{-1} \end{aligned} \tag{15.12}\] Under the Gauss-Markov hypotheses (Theorem 15.1), the conditional variance is \(\mathbb{V}\{\mathbf{u}\mid \mathbf{X} \} = \sigma^2_{\text{u}} \cdot \textbf{I}_n\) and therefore Equation 15.12 reduces to: \[ \begin{aligned} \mathbb{V}\{\mathbf{b}^{\tiny\text{OLS}} \mid \mathbf{X}\} {} & = \sigma_{\text{u}}^2 (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} \mathbf{X} (\mathbf{X}^{\top} \mathbf{X})^{-1} = \\ & = \sigma_{\text{u}}^2 (\mathbf{X}^{\top} \mathbf{X})^{-1} \end{aligned} \]

15.3 Variance decomposition

In a linear model, the deviance (or total variance) of the dependent variable \(\mathbf{y}\) can be decomposed into the sum of the regression variance and the dispersion variance. This decomposition helps us understand how much of the total variability in the data is explained by the model and how much is due to unexplained variability (residuals).

  • Total Deviance (\(\mathbb{D}ev\{\mathbf{y}\}\)): represents the total variability of the dependent variable \(\mathbf{y}\). It is calculated as the sum of the squared difference of \(y_i\) from its mean \(\bar{y}\).

  • Regression Deviance (\(\mathbb{D}ev\mathbb{R}eg\{\mathbf{y}\}\)): represents the portion of variability that is explained by the regression model. It is computed as the sum of the squared differences between the fitted values \(\hat{y}_i\) and \(\bar{y}\).

  • Dispersion Deviance (\(\mathbb{D}ev\mathbb{D}isp\{\mathbf{y}\}\)): represents the portion of variability that is not explained by the model. It is computed as the sum of the squared differences between the observed values \(y_i\) and the fitted values \(\hat{y}_i\) (Equation 14.13).

Hence, the total deviance of \(\mathbf{y}\) can be decomposed as follows: \[ \begin{aligned} \mathbb{D}ev\{\mathbf{y}\} \quad\quad {} & = \quad\ \mathbb{D}ev\mathbb{R}eg\{\mathbf{y}\} \quad\quad + \quad \mathbb{D}ev\mathbb{D}isp\{\mathbf{y}\} \\ \sum_{i = 1}^{n}(y_i - \bar{y})^2 \quad & = \quad\ \sum_{i = 1}^{n}(\hat{y}_i - \bar{y})^2 \quad\;\, + \quad \sum_{i = 1}^{n}(\hat{y}_i - y_i)^2 \\ \mathbf{y}^{\top} \mathbf{y} - n \bar{y}^2 \quad\ & = \; \mathbf{b}^{\top} \mathbf{X}^{\top} \mathbf{X}\mathbf{b} - n \bar{y}^2 \quad\, + \quad \mathbf{u}^{\top} \mathbf{u} \end{aligned} \tag{15.13}\] where \(\bar{y}\) is the sample mean (Equation 10.1).

Variance decomposition
y_bar <- mean(y)
# Fitted values
y_hat <- a + X %*% b
# Residuals
u <- y - y_hat
# Deviance of y
dev_y <- sum(y^2) - n*y_bar^2 # 6.642875
# Deviance of regression
dev_reg <- sum((y_hat - y_bar)^2) # 6.162379
# Deviance of dispersion
dev_disp <- sum(u^2) # 0.4804956
# equal to zero
# dev_y - (dev_reg + dev_disp)

Proof. Let’s prove the expression for the regression deviance \(\mathbb{D}ev\mathbb{R}eg\{\mathbf{y}\}\), i.e. \[ \begin{aligned} \mathbb{D}ev\mathbb{R}eg\{\mathbf{y}\} {} & = \mathbb{D}ev\{\mathbf{y}\} - \mathbb{D}ev\mathbb{D}isp\{\mathbf{y}\} = \\ & = \mathbf{y}^{\top} \mathbf{y} - n \bar{y}^2 - \mathbf{u}^{\top} \mathbf{u} = \\ & = \mathbf{y}^{\top} \mathbf{y} - n \bar{y}^2 - \left(\mathbf{y} - \mathbf{X}\mathbf{b}\right)^{\top} (\mathbf{y} - \mathbf{X}\mathbf{b}) = \\ & = \mathbf{y}^{\top} \mathbf{y} - n \bar{y}^2 - \mathbf{y}^{\top}\mathbf{y} + \mathbf{y}^{\top}\mathbf{X}\mathbf{b} + \mathbf{b}^{\top} \mathbf{X}^{\top}\mathbf{y} - \mathbf{b}^{\top} \mathbf{X}^{\top} \mathbf{X}\mathbf{b} = \\ & = - n \bar{y}^2 + 2 \mathbf{b}^{\top} \mathbf{X}^{\top}\mathbf{y} - \mathbf{b}^{\top} \mathbf{X}^{\top} \mathbf{X}\mathbf{b} = \\ & = \mathbf{b}^{\top} \mathbf{X}^{\top} \mathbf{X}\mathbf{b} - n \bar{y}^2 \end{aligned} \]

The decomposition of the deviance of \(\mathbf{y}\) also holds with respect to the corresponding degrees of freedom.

Deviance Degrees of freedom Variance
\(\mathbb{D}ev\{\mathbf{y}\} = \sum_{i = 1}^{n}(y_i - \bar{y})^2\) \(n-1\) \(\hat{s}^2_{\text{y}} = \frac{\mathbb{D}ev\{\mathbf{y}\}}{n-1}\)
\(\mathbb{D}ev\mathbb{R}eg\{\mathbf{y}\} = \sum_{i = 1}^{n}(\hat{y}_i - \bar{y})^2\) \(k-1\) \(\hat{s}^2_{\text{r}} = \frac{\mathbb{D}ev\mathbb{R}eg\{\mathbf{y}\}}{k-1}\)
\(\mathbb{D}ev\mathbb{D}isp\{\mathbf{y}\} = \sum_{i = 1}^{n}(\hat{y}_i - y_i)^2\) \(n-k\) \(\hat{s}^2_{\text{u}} = \frac{\mathbb{D}ev\mathbb{D}isp\{\mathbf{y}\}}{n-k}\)
Table 15.1: Deviance and variance decomposition in a multivariate linear model

15.3.1 Estimator of \(\sigma_{\text{u}}^2\)

The OLS estimator does not depend on the variance of the residuals \(\sigma_{\text{u}}^2\) and it is not possible to obtain both estimators in one step.

Proposition 15.5 (Unbiased estimator of sigma_u^2) Let’s define an unbiased estimator of the population variance \(\sigma_{\text{u}}^2\) as: \[ \hat{s}_{\text{u}}^2 = \frac{\hat{\mathbf{u}}(\mathbf{b}^{\tiny \text{OLS}})^{\top} \hat{\mathbf{u}}(\mathbf{b}^{\tiny \text{OLS}})}{n-k} \text{.} \] In the notation used here, \(k\) is the number of columns of \(\mathbf{X}\), including the intercept when it is present. Therefore, the residual degrees of freedom are \(n-k\). If \(k\) is used to count only the non-constant regressors, the same denominator is written as \(n-k-1\). In general, the regression variance overestimates the true variance \(\sigma_{\text{u}}^2\), i.e. \[ \mathbb{E}\{\hat{s}_{\text{r}}^2\} = \sigma_{\text{u}}^2 + g(\mathbf{b}, \mathbf{X}), \quad g(\mathbf{b}, \mathbf{X}) \ge 0 \text{.} \] Only in the special case where \(b_2 = \dots = b_k = 0\) in population, then \(g(\mathbf{b}, \mathbf{X}) = 0\) and also the regression variance produces a correct estimate of \(\sigma_{\text{u}}^2\).

Estimator of \(\sigma^2_{\text{u}}\)
s2_u <- dev_disp / (n - k) # 0.0800826

Proof. By definition, the residuals can be computed by pre-multiplying \(\mathbf{y}\) by the matrix \(\mathbf{M}_{\mathbf{x}}\) (Equation 15.7), i.e. \[ \begin{aligned} \hat{\mathbf{u}}(\mathbf{b}^{\tiny \text{OLS}}) & {} = \mathbf{y} - \hat{\mathbf{y}}(\mathbf{b}^{\tiny \text{OLS}}) = \\ & = \mathbf{y} - \mathbf{X} \mathbf{b}^{\tiny \text{OLS}} = \\ & = \mathbf{y} - \mathbf{X} (\mathbf{X}^{\top} \mathbf{X})^{-1}\mathbf{X}^{\top} \mathbf{y} = \\ & = (\textbf{I}_n - \mathbf{P}_{\mathbf{x}}) \mathbf{y} = \\ & = \mathbf{M}_{\mathbf{x}} \mathbf{y} \end{aligned} \] Substituting the true relation in the population, i.e. \(\mathbf{y} = \mathbf{X} \mathbf{b} + \mathbf{u}\), one obtains \[ \begin{aligned} \hat{\mathbf{u}}(\mathbf{b}^{\tiny \text{OLS}}) & {} = \mathbf{M}_{\mathbf{x}} (\mathbf{X} \mathbf{b} + \mathbf{u}) = \\ & = \mathbf{M}_{\mathbf{x}}\mathbf{X} \mathbf{b} + \mathbf{M}_{\mathbf{x}}\mathbf{u} = \\ & = \mathbf{M}_{\mathbf{x}}\mathbf{u} \end{aligned} \] since \(\mathbf{M}_{\mathbf{x}}\mathbf{X} = 0\). Being the matrix \(\mathbf{M}_{\mathbf{x}}\) symmetric and idempotent (Proposition 15.3): \[ \begin{aligned} \hat{\mathbf{u}}(\mathbf{b}^{\tiny \text{OLS}})^{\top} \hat{\mathbf{u}}(\mathbf{b}^{\tiny \text{OLS}}) & {} = (\mathbf{M}_{\mathbf{x}}\mathbf{u})^{\top}(\mathbf{M}_{\mathbf{x}}\mathbf{u}) = \\ & = \mathbf{u}^{\top} \mathbf{M}_{\mathbf{x}}^{\top} \mathbf{M}_{\mathbf{x}} \mathbf{u} = \\ & = \mathbf{u}^{\top} \mathbf{M}_{\mathbf{x}} \mathbf{u} \end{aligned} \] Thus, since \(\mathbf{u}^{\top} \mathbf{M}_{\mathbf{x}} \mathbf{u}\) is a scalar, the expected value of the deviance of dispersion reads \[ \begin{aligned} \mathbb{E}\{\hat{\mathbf{u}}(\mathbf{b}^{\tiny \text{OLS}})^{\top} \hat{\mathbf{u}}(\mathbf{b}^{\tiny \text{OLS}})\} & {} = \mathbb{E}\{\mathbf{u}^{\top} \mathbf{M}_{\mathbf{x}} \mathbf{u} \} = \\ & = \mathbb{E}\{\text{trace}(\mathbf{u}^{\top} \mathbf{M}_{\mathbf{x}} \mathbf{u}) \} = \\ & = \mathbb{E}\{\text{trace}(\mathbf{M}_{\mathbf{x}} \mathbf{u} \mathbf{u}^{\top} ) \} = \\ & = \text{trace}(\mathbf{M}_{\mathbf{x}} \mathbb{E}\{\mathbf{u} \mathbf{u}^{\top}\}) = \\ & = \text{trace}(\mathbf{M}_{\mathbf{x}} \sigma_{\text{u}}^2\textbf{I}_n) = \\ & = \sigma_{\text{u}}^2 \cdot \text{trace}(\mathbf{M}_{\mathbf{x}} \mathbf{I}_n) = \\ & = \sigma_{\text{u}}^2 \cdot \text{trace}(\mathbf{M}_{\mathbf{x}}) \end{aligned} \] The trace (Equation 31.8) of the matrix \(\mathbf{M}_{\mathbf{x}}\) reads \[ \begin{aligned} \text{trace}(\mathbf{M}_{\mathbf{x}}) & {} = \text{trace}(\textbf{I}_n - \mathbf{P}_{\mathbf{x}}) = \\ & = \text{trace}(\textbf{I}_n) - \text{trace}(\mathbf{X} (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top}) = \\ & = \text{trace}(\textbf{I}_n) - \text{trace}( \mathbf{X}^{\top} \mathbf{X} (\mathbf{X}^{\top} \mathbf{X})^{-1}) = \\ & = \text{trace}(\textbf{I}_n) - \text{trace}(\textbf{I}_{k}) = \\ & = n - k \end{aligned} \] where \(k\) counts all columns of \(\mathbf{X}\), including the intercept when present. Hence, \[ \mathbb{E}\{\hat{\mathbf{u}}(\mathbf{b}^{\tiny \text{OLS}})^{\top} \hat{\mathbf{u}}(\mathbf{b}^{\tiny \text{OLS}})\} = \sigma_{\text{u}}^2 \cdot (n-k) \text{.} \] Equivalently, the expectation of the deviance of dispersion is equal to \[ \mathbb{E}\{\mathbb{D}ev\mathbb{D}isp\{\mathbf{y}\}\} = \sigma_{\text{u}}^2 \cdot (n-k) \text{.} \]

15.3.2 \(\text{R}^2\)

The \(\text{R}^2\) statistic, also known as the coefficient of determination, is a measure used to assess the goodness of fit of a regression model. In a multivariate context, it evaluates how well the independent variables explain the variability of the dependent variable.

Definition 15.1 (Multivariate R^2) The \(R^2\) represents the proportion of the variation in the dependent variable that is explained or predicted by the independent variables. Formally, it is defined as the ratio of the deviance explained by the model (\(\mathbb{D}ev\mathbb{R}eg\{\mathbf{y}\}\)) to the total deviance (\(\mathbb{D}ev\{\mathbf{y}\}\)). It can also be expressed as one minus the ratio of the residual deviance (\(\mathbb{D}ev\mathbb{D}isp\{\mathbf{y}\}\)) to the total deviance, i.e. \[ \text{R}^2 = \frac{\mathbb{D}ev\mathbb{R}eg\{\mathbf{y}\}}{\mathbb{D}ev\{\mathbf{y}\}} = 1- \frac{\mathbb{D}ev\mathbb{D}isp\{\mathbf{y}\}}{\mathbb{D}ev\{\mathbf{y}\}} \text{.} \tag{15.14}\] Using the variance decomposition (Equation 15.13), it is possible to write a multivariate version of the \(\text{R}^2\) as: \[ \text{R}^2 = \frac{(\mathbf{b}^{\tiny \text{OLS}})^{\top} \mathbf{X}^{\top} \mathbf{X}\mathbf{b}^{\tiny \text{OLS}} - n \bar{y}^2}{\mathbf{y}^{\top} \mathbf{y} - n \bar{y}^2} = 1 - \frac{\hat{\mathbf{u}}(\mathbf{b}^{\tiny \text{OLS}})^{\top} \hat{\mathbf{u}}(\mathbf{b}^{\tiny \text{OLS}})}{\mathbf{y}^{\top} \mathbf{y} - n \bar{y}^2} \text{.} \tag{15.15}\]

\(\text{R}^2\)
R2 <- 1 - dev_disp  / dev_y # 0.9276675

The numerator represents the variance explained by the regression model, while the denominator the total variance in the dependent variable. The term numerator in the second expression represents the variance of the residuals, or the variance not explained by the model. A value of the \(\text{R}^2\) close to 1 denotes that a large proportion of the variability of the dependent variable has been explained by the regression model, while a value close to 0 indicates that the model explains very little of the variability.

Variance Inflation Factor (VIF)

An alternative expression for the variance of the \(j\)-th regressor (Equation 15.11) reads \[ \mathbb{V}\{b_j^{\tiny \text{OLS}}\} = \frac{\sigma_{\text{u}}^2}{\mathbb{D}ev\{X_j\}} \underset{\text{VIF}_j}{\underbrace{ \frac{1}{1-R^2_{j0}}}} \text{,} \] where \(\mathbb{D}ev\{X_j\}\) is the deviance of the regressor \(j\) and \(R^2_{j0}\) is the multivariate coefficient of determination on the regression of \(X_j\) on the other regressors.

Proposition 15.6 (Adjusted R^2) A more robust indicator that does not always increase with the addition of a new regressor is the adjusted \(\text{R}^2\), which is computed as: \[ \bar{\text{R}}^2 = 1 - \frac{n-1}{n-k} \frac{\mathbb{D}ev\mathbb{D}isp\{\mathbf{y}\}}{\mathbb{D}ev\{\mathbf{y}\}} = 1 - \frac{\hat{s}^2_{\text{u}} }{\hat{s}^2_{\text{y}}} \text{.} \] The \(\bar{\text{R}}^2\) can be negative, and its value will always be less than or equal to that of \(\text{R}^2\). Unlike \(\text{R}^2\), the adjusted version increases only when the new explanatory variable improves the model more than would be expected simply by adding another variable.

Adjusted \(\text{R}^2\)
# R2_bar <- 1 - (n - 1) / (n - k) * (dev_disp / dev_y) # 0.8915013

Proof. To arrive at the formulation of the adjusted \(\text{R}^2\) let’s consider that under the null hypothesis \(\mathcal{H}_0: b_2 = \dots = b_k = 0\), where only the intercept is different from zero, the variance of regression \(\hat{s}_{r}^2\) (Table 15.1) is a correct estimate of the variance of the residuals \(\sigma^2\). Hence, under \(\mathcal{H}_0\): \[ \frac{n-1}{k-1} \mathbb{E}\left\{\frac{\mathbb{D}ev\mathbb{R}eg\{\mathbf{y}\}}{\mathbb{D}ev\{\mathbf{y}\}} \right\} \overset{\sim}{=} 1 \text{.} \] This implies that the expectation of the \(R^2\) is not zero (as it should be under \(\mathcal{H}_0\)) but: \[ \mathbb{E}\{\text{R}^2\} \overset{\sim}{=} \frac{k-1}{n-1} \text{.} \] Let’s rescale the \(\text{R}^2\) such that when \(\mathcal{H}_0\) holds true it is equal to zero, i.e. \[ \text{R}_{c}^2 = \text{R}^2 - \frac{k-1}{n-1} \text{.} \] However, the specification of \(\text{R}_{\text{c}}^2\) implies that when \(\text{R}^2 = 1\) (perfect linear relation between \(\mathbf{X}\) and \(\mathbf{y}\)) the value of \(\text{R}_{\text{c}}^2 < 1\), i.e. \(\text{R}_{\text{c}}^2 = \frac{n-k}{n-1} < 1\). Hence, let’s correct again the indicator such that it takes values in \([0,1]\), i.e. \[ \begin{aligned} \bar{\text{R}}^2 & {} = \left(\text{R}^2 - \frac{k-1}{n-1}\right) \frac{n-1}{n-k} = \\ & = \left(\frac{\text{R}^2(n-1) - (k-1)}{n-1}\right) \frac{n-1}{n-k} = \\ & = \frac{n-1}{n-k}\text{R}^2 - \frac{k-1}{n-k} \end{aligned} \] Remembering that \(\text{R}^2\) can be rewritten as in Equation 15.14, one obtains: \[ \begin{aligned} \bar{\text{R}}^2 & {} = \frac{n-1}{n-k}\left(1-\frac{\mathbb{D}ev\mathbb{D}isp\{\mathbf{y}\}}{\mathbb{D}ev\{\mathbf{y}\}}\right) - \frac{k-1}{n-k} = \\ & = \frac{(n-1)\mathbb{D}ev\{\mathbf{y}\} - (n-1) \mathbb{D}ev\mathbb{D}isp\{\mathbf{y}\}}{\mathbb{D}ev\{\mathbf{y}\}(n-k)} - \frac{k-1}{n-k} = \\ & = \frac{n-1}{n-k} - \frac{n-1}{n-k} \frac{\mathbb{D}ev\mathbb{D}isp\{\mathbf{y}\}}{\mathbb{D}ev\{\mathbf{y}\}} - \frac{k-1}{n-k} = \\ & = 1 - \frac{n-1}{n-k} \frac{\mathbb{D}ev\mathbb{D}isp\{\mathbf{y}\}}{\mathbb{D}ev\{\mathbf{y}\}} = \\ & = 1 - \frac{\hat{s}^2_{\text{u}} }{\hat{s}^2_{\text{y}}} \end{aligned} \]

Limitations of \(\text{R}^2\)

The \(\text{R}^2\) statistic has some limitations. Firstly, it can be close to 1 even if the relationship between the variables is not linear. Additionally, \(\text{R}^2\) increases whenever a new regressor is added to the model, making it unsuitable for comparing models with different numbers of regressors.

15.4 Diagnostic

Let’s consider a linear model where the residuals \(\mathbf{u}\) are IID normally distributed random variables. Hence, the working hypothesis of the Gauss Markov theorem holds true.

15.4.1 \(\text{t}\)-test for \(b_j\)

A t-test evaluates if a parameter in a regression is statistically different from zero, given the effect of the other \(k-1\) regressors. The test is built under the null hypothesis of linear independence in the population between \(\mathbf{y}\) and \(\mathbf{X}_j\), i.e. \[ \mathcal{H}_0: b_j = 0 \quad \mathcal{H}_1: b_j \neq 0 \text{.} \] If the residuals are normally distributed, then the vector of parameters \(\hat{\mathbf{b}}\) is distributed as a multivariate normal, thus also the marginal distribution of each \(\hat{b}_j\) will be normal.

Using the expectation (Equation 15.8) and variance (Equation 15.10) of \(b^{\tiny \text{OLS}}_j\), we can standardize the estimated parameter \(b^{\tiny \text{OLS}}_j\) to obtain a Student-t distributed statistic (Equation 32.2), i.e. \[ \text{t}_j = \frac{b^{\tiny \text{OLS}}_j - \mathbb{E}\{b^{\tiny \text{OLS}}_j\}}{\sqrt{\mathbb{V}\{b^{\tiny \text{OLS}}_j\}}} = \frac{b^{\tiny \text{OLS}}_j - b_j}{\sqrt{\hat{s}_\text{u}^2 \cdot c_{jj}}} \sim t_{n-k} \text{,} \tag{15.16}\] where the unknown \(\sigma_\text{u}^2\) is replaced with its correct estimator \(\hat{s}_\text{u}^2\). Under \(\mathcal{H}_0: b_j = 0\), one obtains \[ \text{t}_j \underset{\mathcal{H}_0}{=} \frac{b^{\tiny \text{OLS}}_j}{\sqrt{\hat{s}_\text{u}^2 \cdot c_{jj}}} \sim t_{n-k} \text{.} \tag{15.17}\] Given a significance level \(\alpha\), the test is rejected if the test statistic falls in the rejection area, i.e. \[ \mathcal{H}_0 \text{ is rejected} \iff \text{t}_j < q_{\alpha /2} \quad\text{OR}\quad \text{t}_j > q_{1-\alpha /2} \text{,} \] where \(q\) is the quantile function of a Student-t with \(\nu = n-k\) degrees of freedom.

t-tests
# variance of the b_ols
v_b_ols <- diag(tXX) * s2_u
# t-statistic
t_stat <- b / sqrt(v_b_ols)
# p-values
p.val <- (1-pt(abs(t_stat), n-k))*2
# [t1]  2.480090  --> p.value 0.0477  (4.77 %)
# [t2]  2.867638  --> p.value 0.0285  (2.85 %)
# [t3] -7.696738  --> p.value 0.00025 (0.025 %)

15.4.2 Confidence intervals for \(b\)

Under the assumption of normality, from Equation 15.16, one can build a confidence interval for \(b_j\), i.e. \[ b_j \in \left[ b_j^{\tiny \text{OLS}} + q_{\alpha/2} \sqrt{\mathbb{V}\{b_j^{\tiny \text{OLS}}\}}, b_j^{\tiny \text{OLS}} + q_{1-\alpha/2} \sqrt{\mathbb{V}\{b_j^{\tiny \text{OLS}}\}} \right] \text{.} \] where \(\alpha\) is the significance level, \(q_{\alpha}\) is the quantile at level \(\alpha\) of a Student-t distribution with \(n-k\) degrees of freedom, and \(\sqrt{\mathbb{V}\{b_j^{\tiny \text{OLS}}\}}\) reads as in Equation 15.11.

Confidence intervals
conf.int <- cbind(b + qt(0.05, n-k) * sqrt(v_b_ols),
                  b + qt(0.95, n-k) * sqrt(v_b_ols))
# With 90% probability the true b is inside the bounds
# [b1]  0.05049223  < b <  0.4159749
# [b2]  0.11779346  < b <  0.6129894
# [b3] -0.62083599  < b < -0.3705442

15.4.3 F-test for the regression

The \(\text{F}_{\text{test}}\) evaluates the significance of the entire regression model by testing the null hypothesis of linear independence between \(\mathbf{y}\) and the non-constant regressors in \(\mathbf{X}\), i.e. \[ \mathcal{H}_0: b_2 = \dots = b_{k} = 0 \quad \mathcal{H}_1: \text{at least one } b_j \neq 0 \text{,} \] where the only coefficient different from zero is the intercept. In this case, the test statistic reads \[ \text{F}_{\text{test}} = \frac{\hat{s}^2_{\text{r}}}{\hat{s}^2_{\text{u}}} = \frac{\mathbb{D}ev\mathbb{R}eg(\mathbf{y})\cdot (n-k)}{(k-1) \cdot \mathbb{D}ev\mathbb{D}isp(\mathbf{y})} \sim \text{F}_{k-1, n-k} \text{,} \tag{15.18}\] that is distributed as a Fisher-Snedecor random variable (Equation 32.3) with \(\nu_1 = k-1\) and \(\nu_2 = n-k\) degrees of freedom. \(\hat{s}^2_{r}\) is the regression variance and \(\hat{s}^2_{\text{u}}\) is the dispersion variance. By fixing a significance level \(\alpha\), the null hypothesis \(\mathcal{H}_0\) is rejected if \(\text{F}_{\text{test}} > q_{1-\alpha}\), where \(q_{1-\alpha}\) is the corresponding \(\text{F}_{k-1,n-k}\) quantile. Remembering the relation between the deviance and the \(\text{R}^2\), i.e. \(\mathbb{D}ev\mathbb{R}eg(\mathbf{y}) = \text{R}^2 \mathbb{D}ev(\mathbf{y})\) and \(\mathbb{D}ev\mathbb{D}isp(\mathbf{y}) = (1-\text{R}^2) \mathbb{D}ev(\mathbf{y})\), it is possible to express the \(\text{F}\)-test in terms of the multivariate \(\text{R}^2\) as: \[ \text{F}_{\text{test}} = \frac{\text{R}^2}{1 - \text{R}^2} \frac{n-k}{k-1} \sim \text{F}_{k-1, n-k} \text{.} \]

\(\text{F}_{\text{test}}\)
F_test <- R2/(1-R2) * (n-k)/(k-1)    # 19.23757
F_test <- dev_reg / dev_disp * ((n-k)/(k-1)) # 19.23757
# Critical value
# qf(0.9, k-1, n-k)
# P-value
# 1 - pf(F_test, k-1, n-k) # 0.0008050532 (0.08 %)
Interpretation \(\text{F}\)-test

If the null hypothesis \(\mathcal{H}_0\) is rejected then:

  • The variability of \(Y\) explained by the model is significantly greater than the residual variability.
  • At least one of the \(k\) regressors has a coefficient \(b_k\) that is significantly different from zero in the population.

On the contrary, if \(\mathcal{H}_0\) is not rejected, then the model is not adequate and there is no evidence of a linear relation between \(\mathbf{y}\) and \(\mathbf{X}\).

15.5 Multi-equations OLS

Proposition 15.7 (Multi-equations OLS estimator) Let’s consider a multivariate linear model, i.e. with \(p > 1\) in (Equation 14.3), then the model in matrix notation reads: \[ \underset{\small n \times p}{\mathbf{Y}} = \mathbf{J}_{n,1} \, \underset{\small 1 \times p}{\mathbf{a}^{\top}} + \underset{\small n \times k}{\mathbf{X}} \, \underset{\small k \times p}{\mathbf{b}^{\top}} + \underset{\small n \times p}{\mathbf{U}} \text{,} \] then the OLS estimate of \(\mathbf{b}\) is obtained from Equation 15.4, i.e. \[ \mathbf{b}^{\tiny\text{OLS}} = \mathbb{C}v(\mathbf{Y}, \mathbf{X}) \, \mathbb{C}v( \mathbf{X})^{-1} \text{,} \] and similarly for the intercept \(\mathbf{a}\) \[ \mathbf{a}^{\tiny\text{OLS}} = \mathbb{E}\{\mathbf{Y}\} - \mathbf{b}^{\tiny\text{OLS}} \mathbb{E}\{\mathbf{X}\} \text{.} \] The variance covariance matrix of the residuals is computed as: \[ \boldsymbol{\Sigma} = \mathbb{C}v(\mathbf{u}) = \mathbb{C}v(\mathbf{Y}) - \mathbf{b}^{\tiny\text{OLS}} \, \mathbb{C}v(\mathbf{Y}, \mathbf{X}) \text{.} \]

Example 15.1 Let’s simulate \(n\) observations for the regressors \(\mathbf{X}\) from a multivariate normal distribution with parameters \[ \boldsymbol{\mu}_X = \begin{pmatrix} 0.5 \\ 0.5 \\ 0.5 \end{pmatrix} \quad \boldsymbol{\Sigma}_X = \begin{pmatrix} 0.5 & 0.2 & 0.1\\ 0.2 & 1.2 & 0.1\\ 0.1 & 0.1 & 0.3\\ \end{pmatrix} \text{.} \] Then, to construct two dependent variables we simulate a matrix \(p \times k = 6\) for the parameters \(\mathbf{b}\) from a standard normal, i.e. for \(j = 1, \dots, 6\), \(b_j \sim \mathcal{N}(0, 1)\) and the intercept parameters \(\mathbf{a}\) from a uniform distribution in [0,1], i.e. \[ \underset{\small p \times k}{\mathbf{b}} = \begin{pmatrix} b_{1,1} & b_{1,2} & b_{1,3} \\ b_{2,1} & b_{2,2} & b_{2,3} \\ \end{pmatrix} \quad \underset{\small p \times 1}{\mathbf{a}} = \begin{pmatrix} a_{1} \\ a_{2} \\ \end{pmatrix} \text{.} \] Thus, for \(i = 1, \dots, n\), one obtains a multi-equation model of the form: \[ \begin{cases} Y_{i,1} = \beta_{0, 1} + \beta_{1, 1} X_{i, 1} + \beta_{1, 2} X_{i, 2} + \beta_{1, k} X_{i, 3} + u_{i, 1} \\ Y_{i,2} = \beta_{0, 2} + \beta_{2, 1} X_{i, 1} + \beta_{2, 2} X_{i, 2} + \beta_{2, k} X_{i, 3} + u_{i, 2} \end{cases} \] where \(u_{i,1}\) and \(u_{i,2}\) are simulated from multivariate normal random variables with true covariance matrix equal to: \[ \mathbb{C}v\{\mathbf{u}\} = \begin{pmatrix} 0.55 & 0.3 \\ 0.3 & 0.70 \end{pmatrix} \text{.} \]

Setup
library(dplyr)
######################## Setup ########################
set.seed(1) # random seed
n <- 500    # number of observations
p <- 2      # number of dependent variables
k <- 3      # number of regressors
# True regressor's mean
true_e_x <- matrix(rep(0.5, k), ncol = 1)
# True regressor's covariance matrix
true_cv_x <-  matrix(c(v_z1 = 0.5, cv_12 = 0.2, cv_13 = 0.1,
                       cv_21 = 0.2, v_z2 = 1.2, cv_23 = 0.1,
                       cv_31 = 0.1, cv_32 = 0.1, v_z3 = 0.3),
                     nrow = k, byrow = FALSE)
# True covariance of the residuals
true_cv_e <- matrix(c(0.55, 0.3, 0.3, 0.70), nrow = p)
##########################################################
# Generate a synthetic data set
## Regressors
X <- mvtnorm::rmvnorm(n, true_e_x, true_cv_x)
## Slope (Beta)
true_beta <- rnorm(p*k)
true_beta <- matrix(true_beta, ncol = k, byrow = TRUE)
## Intercept (Alpha)
true_alpha <- runif(p, min = 0, max = 1)
true_alpha <- matrix(true_alpha, ncol = 1)
## Matrix of 1 for matrix multiplication
ones <- matrix(rep(1, n), ncol = 1)
## Fitted response variable
Y <- ones %*% t(true_alpha) + X %*% t(true_beta)
## Simulated error
eps <- mvtnorm::rmvnorm(n, sigma = true_cv_e)
## Perturbed response variable
Y_tilde <- Y + eps
Parameters fit
# True Beta
df_beta_true <- dplyr::as_tibble(true_beta)
colnames(df_beta_true) <- paste0("$\\mathbf{b}_", 1:ncol(df_beta_true), "$")
# Perturbed Beta (fitted)
fit_beta <- cov(Y_tilde, X) %*% solve(cov(X))
df_beta_pert <- dplyr::as_tibble(fit_beta)
colnames(df_beta_pert) <- paste0("$\\mathbf{b}_", 1:ncol(df_beta_pert), "$")
# True Alpha
df_alpha_true <- dplyr::as_tibble(t(true_alpha))
colnames(df_alpha_true) <- paste0("$\\mathbf{a}_", 1:ncol(df_alpha_true), "$")
# Perturbed Alpha (fitted)
## Perturbed mean
e_y <- matrix(apply(Y_tilde, 2, mean), ncol = 1)
e_x <- matrix(apply(X, 2, mean), ncol = 1)
## Estimated Alpha (on perturbed data)
fit_alpha <- e_y - cov(Y_tilde, X) %*% solve(cov(X)) %*% e_x
df_alpha_pert <- dplyr::as_tibble(t(fit_alpha))
colnames(df_alpha_pert) <- paste0("$\\mathbf{a}_", 1:ncol(df_alpha_pert), "$")
Parameter \(\mathbf{b}_1\) \(\mathbf{b}_2\) \(\mathbf{b}_3\)
True 0.8500 -0.9253 0.8936
True -0.9410 0.5390 -0.1820
Fitted 0.8457 -0.8699 0.9396
Fitted -0.9532 0.5518 -0.1804
Table 15.2: Fitted parameters
Parameter \(\mathbf{a}_1\) \(\mathbf{a}_2\)
True 0.8137 0.8068
Fitted 0.7942 0.7423
Table 15.3: Fitted parameters