14 Introduction

Statistical modeling applies statistical methods to real-world data to give empirical content to relationships. It aims to quantify phenomena and develop models and test hypotheses, making it a crucial field for economic research, policy analysis, and decision-making. The aim of the statistical modeling is to study the (unknown) mechanism that generates the data, i.e., the Data Generating Process (DGP). The statistical model is a function that approximates the DGP.

14.1 The matrix of data

Let’s consider $n$ realizations defining a sample for $i = 1, 2, \dots, n$ . Suppose we have $p$ dependent variables and $k$ explanatory variables (also known as predictors). The data matrix of the exogenous (regressors) $X_{n, k}$ is defined as in Equation 13.1, while the matrix composed by the endogenous (dependent) variables $Y$ reads $\underset{n \times p}{Y} = (\begin{matrix} y_{1, 1} & y_{1, 2} & \dots & y_{1, j} & \dots & y_{1, p} \\ y_{2, 1} & y_{2, 2} & \dots & y_{2, j} & \dots & y_{2, p} \\ ⋮ & ⋮ & ⋱ & ⋮ & ⋱ & ⋮ \\ y_{n, 1} & y_{n, 2} & \dots & y_{n, j} & \dots & y_{n, p} \end{matrix}) .$ Hence, the complete matrix of data reads $\begin{matrix} (14.1) & \underset{n \times (k + p)}{W} = (\begin{matrix} Y & X \end{matrix}) = (\begin{matrix} y_{1, 1} & \dots & y_{1, p} & x_{1, 1} & \dots & x_{1, k} \\ ⋮ & ⋱ & ⋮ & ⋮ & ⋱ & ⋮ \\ y_{n, 1} & \dots & y_{n, p} & x_{n, 1} & \dots & x_{n, k} \end{matrix}) . \end{matrix}$ In general, when $p = 1$ then the model has only one equation to satisfy for $i = 1, \dots, n$ , for example $\begin{matrix} (14.2) & Y_{i} = b_{0} + b_{1} x_{i, 1} + b_{2} x_{i, 2} + \dots + b_{k} x_{i, k} + u_{i} . \end{matrix}$ Otherwise, when $p > 1$ there are more than one dependent variable and the model is composed by $p$ -equations for $i = 1, \dots, n$ , i.e. the same linear model with $p$ equations reads: $\begin{matrix} (14.3) & {\begin{cases} Y_{i, 1} = b_{0, 1} + b_{1, 1} x_{i, 1} + b_{1, 2} x_{i, 2} + \dots + b_{1, k} x_{i, k} + u_{i, 1} \\ Y_{i, 2} = b_{0, 2} + b_{2, 1} x_{i, 1} + b_{2, 2} x_{i, 2} + \dots + b_{2, k} x_{i, k} + u_{i, 2} \\ ⋮ \\ Y_{i, p} = b_{0, p} + b_{p, 1} x_{i, 1} + b_{p, 2} x_{i, 2} + \dots + b_{p, k} x_{i, k} + u_{i, p} \end{cases} \end{matrix}$ Thus the matrix of the residuals components reads $\underset{n \times p}{U} = (\begin{matrix} u_{1, 1} & u_{1, 2} & \dots & u_{1, i} & \dots & u_{1, p} \\ u_{2, 1} & u_{2, 2} & \dots & u_{2, i} & \dots & u_{2, p} \\ ⋮ & ⋮ & ⋱ & ⋮ & ⋱ & ⋮ \\ u_{n, 1} & u_{n, 2} & \dots & u_{n, i} & \dots & u_{n, p} \end{matrix}),$

14.2 Joint, conditional and marginals

Let’s consider the bi-dimensional random vector in Equation 14.1 and let’s write the joint distribution of $X$ and $Y$ , i.e. $\begin{matrix} (14.4) & \underset{joint probability}{P (Y_{1}, \dots, Y_{p} \leq y_{1}, \dots, y_{p}, X_{1}, \dots, X_{p} \leq x_{1}, \dots, x_{k}}) = \underset{distribution function}{F_{Y, X} (y_{1}, \dots, y_{p}, x_{1}, \dots, x_{k}}) . \end{matrix}$ In the continuous case, there exists a joint density $f_{Y, X}$ such that: $\begin{matrix} (14.5) & F_{Y, X} (y, x) = \int_{- \infty}^{x} \int_{- \infty}^{y} f_{Y, X} (y, x) d y d x . \end{matrix}$ Moreover, from the joint distribution (Equation 14.4) it is possible to recover the marginals distributions, i.e.
$\begin{matrix} (14.6) & \begin{aligned} f_{Y} (y) = \partial_{y} F_{Y, X} (y, x) = \int_{- \infty}^{\infty} f_{Y, X} (y, x) d x, \\ f_{X} (x) = \partial_{x} F_{Y, X} (y, x) = \int_{- \infty}^{\infty} f_{Y, X} (y, x) d y . \end{aligned} \end{matrix}$

Then, given the marginals (Equation 14.6), it is possible to compute the unconditional moments, for example

First moment: $E {Y} = \int_{- \infty}^{\infty} y f_{Y} (y) d y$ .
Second moment: $E {Y^{2}} = \int_{- \infty}^{\infty} y^{2} f_{Y} (y) d y$ .

Applying the Bayes theorem (Theorem 6.2), from the joint distribution (Equation 14.4) it is possible to recover the conditional distribution, i.e
$\begin{matrix} (14.7) & f_{Y ∣ X} (y ∣ x) = \frac{f_{Y, X} (y, x)}{f_{X} (x)} ⟺ \underset{joint}{f_{Y, X} (y, x)} = \underset{conditional}{f_{Y ∣ X} (y ∣ x)} \cdot \underset{marginal}{f_{X} (x)} . \end{matrix}$ Given the conditional distribution the conditional moments reads $\begin{aligned} E {Y ∣ X} = \int_{- \infty}^{\infty} y f_{Y ∣ X} (y ∣ x) d y, \\ E {Y^{2} ∣ X} = \int_{- \infty}^{\infty} y^{2} f_{Y ∣ X} (y ∣ x) d y . \end{aligned}$

Example: Inference on a Multivariate Gaussian model

Example 14.1 Let’s consider a multivariate Gaussian setup, i.e. $(\begin{matrix} Y \\ X \end{matrix}) \sim MVN ((\begin{matrix} μ_{Y} \\ μ_{X} \end{matrix}), (\begin{matrix} Σ_{Y Y} & Σ_{Y X} \\ μ_{X Y} & Σ_{X X} \end{matrix})) .$ If $(X Y)$ are jointly normal, then the marginals are multivariate normal, i.e. $Y \sim MVN (μ_{Y}, Σ_{Y Y}), X \sim MVN (μ_{X}, Σ_{X X}),$ and also the conditionals distributions, i.e. $Y ∣ X \sim MVN (μ_{Y ∣ X}, Σ_{Y Y ∣ X}), X ∣ Y \sim MVN (μ_{X ∣ Y}, Σ_{X X ∣ Y}) .$ In such model’s setup the conditional expectation of $Y$ given $X$ reads $\begin{aligned} E {Y ∣ X} & = μ_{Y ∣ X} = \\ = μ_{Y} + Σ_{Y X} \cdot Σ_{X X}^{- 1} (X - μ_{X}) = \\ = μ_{Y} - Σ_{Y X} \cdot Σ_{X X}^{- 1} μ_{X} + Σ_{Y X} \cdot Σ_{X X}^{- 1} X = \\ = μ_{Y} - b_{Y ∣ X} μ_{X} + b_{Y ∣ X} X = \\ = a_{Y ∣ X} + b_{Y ∣ X} X \end{aligned}$ and the conditional variance as $V {Y ∣ X} = Σ_{Y Y} - Σ_{Y X} \cdot Σ_{X X}^{- 1} Σ_{Y X} .$ In this setup the parameters are:

Joint distribution, $θ = {μ_{Y}, μ_{X}, Σ_{X X}, Σ_{X Y}, Σ_{Y Y}}$ .
Conditional distribution, $λ_{1} = {a_{Y ∣ X}, b_{Y ∣ X}, Σ_{Y Y ∣ X}}$ .
Marginal distribution, $λ_{2} = {μ_{X}, Σ_{X X}}$ .

Noting that $λ_{1}$ is a function of $θ$ , i.e. $τ = f (λ_{1})$ in the Gaussian case it is possible to prove that $λ_{1}$ and $λ_{2}$ are free to vary. Hence, imposing restrictions on $λ_{1}$ do not impose restrictions on $λ_{2}$ . In general, if the parameters of interest are a function of the conditional distribution and $λ_{1}$ and $λ_{2}$ are free to vary, then the inference can be done without losing of information considering the conditional model. In this case we say that $X$ is weakly exogenous for $τ = f (λ_{1})$ .

14.3 Conditional expectation model

Let’s consider a very general conditional expectation model with $p = 1$ , of which the linear models are a special case. In matrix notation it can be written as: $\begin{matrix} (14.8) & y = E {y ∣ X} + u, \end{matrix}$ where the conditional expectation errors are defined as: $\begin{matrix} (14.9) & u = y - E {y ∣ X} . \end{matrix}$

Proposition 14.1 In a conditional expectation model as in Equation 14.8, the residuals $u$ , defined as in Equation 14.9, have unconditional expectation and covariance with the regressors $X$ equal to zero, i.e. $E {u} = 0, E {u X} = 0 .$ Moreover, the conditional expectation error is orthogonal to any transformation of the conditioning variables, i.e. $\begin{matrix} (14.10) & y = E {y ∣ g (X)} + u ⟹ E {u g (X)} = 0 . \end{matrix}$

Proof: Proposition 14.1

Proof. Let’s start the unconditional expectation of the residuals defined in Equation 14.8, i.e. $\begin{aligned} E {u} & = E {y - E {y ∣ X}} = \\ = E {y} - E {E {y ∣ X}} = \\ = E {y} - E {y} = 0 \end{aligned}$ Then, let’s compute the expected value of between the residuals and the regressors, i.e. $E {u} = 0 ⟹ C v {u, X} = E {u X} .$ For simplicity let’s assume that $X$ can takes only values in ${0, 1}$ . Applying the tower property of conditional expectation one obtain: $\begin{aligned} E {u X} & = E {E {u X ∣ X}} = \\ = E {u X ∣ X = 0} P (X = 0) + E {u X ∣ X = 1} P (X = 1) = \\ = E {u X ∣ X = 1} P (X = 1) \end{aligned}$ Then, let’s substitute $u$ from Equation 14.8 and $X$ with $1$ , i.e. $\begin{aligned} E {u X} & = E {(y - E {y ∣ X}) X ∣ X = 1} P (X = 1) = \\ = E {y ∣ X = 1} P (X = 1) - E {E {y ∣ X} ∣ X = 1} P (X = 1) = \\ = E {y ∣ X = 1} P (X = 1) - E {y ∣ X = 1} P (X = 1) = 0 \end{aligned}$ For a general transformation of the regressors as in Equation 14.10, the covariance is computed as: $\begin{aligned} E {u g (X)} & = E {E {u g (X) ∣ X}} = \\ = E {g (X) E {u ∣ X}} = \\ = E {g (X) E {y - E {y ∣ X} ∣ X}} = \\ = E {g (X) [E {y ∣ X} - E {y ∣ X}]} = 0 \end{aligned}$

14.4 Uniequational linear models

Let’s consider an uni-equational linear model, i.e. with $p = 1$ in (Equation 14.2), is expressed in compact matrix notation as: $\begin{matrix} (14.11) & \underset{n \times 1}{y} = \underset{n \times k}{X} \underset{k \times 1}{b} + \underset{n \times 1}{u}, \end{matrix}$ where $b$ and $u$ represent the true parameters and residuals in population. Let’s consider a sample of $n$ -observations extracted from a population, then the matrix of the regressors $X$ reads as in Equation 13.1, while the vectors of dependent variable and of the residuals reads $\underset{n \times 1}{y} = (\begin{matrix} y_{1} \\ ⋮ \\ y_{n} \end{matrix}), \underset{n \times 1}{u} = (\begin{matrix} u_{1} \\ ⋮ \\ u_{n} \end{matrix}) .$ Hence, the matrix of data $W$ is composed by: $\begin{matrix} (14.12) & (\begin{matrix} y & X \end{matrix}) = (\begin{matrix} y_{1} & x_{1, 1} & \dots & x_{1, k} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ y_{n} & x_{n, 1} & \dots & x_{n, k} \end{matrix}) . \end{matrix}$

14.4.1 Estimators of b

In general, the true population parameter $b$ is unknown and lives in the parameter space, i.e. $b \in Θ_{b} \subset R^{k}$ . In the following section we will define with $Q$ the estimator function while $\hat{b}$ will denote one of its possible estimates.

Depending on the specification of the model, the function $Q$ takes as input the matrix of data and returns as output a vector of estimates, i.e. $Q : W ⟶ Θ_{b}, such that Q (W) = \hat{b} \in Θ_{b} .$ In this context, the fitted values $\hat{y}$ can be seen as function of the estimate $\hat{b}$ , i.e. $\begin{matrix} (14.13) & \hat{y} (\hat{b}) = X \hat{b} . \end{matrix}$ Consequently, the fitted residuals $\hat{u}$ , which measure the discrepancies between the observed and the fitted values, are also a function of $b$ , i.e. $\begin{matrix} (14.14) & \hat{u} (\hat{b}) = y - \hat{y} (\hat{b}) = y - X \hat{b} . \end{matrix}$

Different optimal estimators of

b

As we will see in Chapter 15, the assumptions on the variance of the residuals determines the optimal estimator of $b$ . In general, the residuals could be

omoskedastic: the residuals are uncorrelated and their variance is equal for each observation.
heteroskedastic: the residuals are uncorrelated and their variance is difference for each observation.
autocorrelated: the residuals are correlated and their variance is equal and their variance is difference for each observation.

As shown in Figure 14.1, depending on the assumption (1,2 or 3) the optimal estimator of $b$ is obtained with Ordinary Last Square for case 1, while Generalized Last Square for case 2 and 3.

Figure 14.1: Different classes of estimator for linear models.

For example, if the residuals are correlated, then their conditional variance-covariance matrix reads $Σ = V {u u^{⊤} ∣ X},$ or more explicitly,
$\begin{matrix} (14.15) & \underset{n \times n}{Σ} = (\begin{matrix} u_{1}^{2} & u_{1} u_{2} & \dots & u_{1} u_{n} \\ u_{2} e_{1} & u_{2}^{2} & \dots & u_{2} u_{n} \\ ⋮ & ⋮ & ⋮ \\ u_{n} e_{1} & u_{n} u_{2} & \dots & u_{n}^{2} \end{matrix}) = (\begin{matrix} σ_{1}^{2} & σ_{1, 2} & \dots & σ_{1, n} \\ σ_{2, 1} & σ_{2}^{2} & \dots & σ_{2, n} \\ ⋮ & ⋮ & ⋮ \\ σ_{n, 1} & σ_{n, 2} & \dots & σ_{n}^{2} \end{matrix}) . \end{matrix}$ Since the matrix $Σ$ is symmetric the number of distinct elements above (or below) the diagonal reads $(\binom{n}{2}) = \frac{n (n - 1)}{2} .$ Hence, given that the number of elements of $Σ$ is $n \times n$ , the number of unique values (free elements) are given by the $n$ variances and plus $\frac{n (n - 1)}{2}$ covariances, i.e. $free elements = n + \frac{n (n - 1)}{2} = \frac{n (n + 1)}{2} > n .$