31 Linear Algebra
31.1 Vector multiplication
Consider a vector of this form,
\[ \underset{\color{red}{n \times 1}}{\mathbf{y}} = \begin{pmatrix} y_{1} \\ \vdots \\ y_{i} \\ \vdots \\ y_{n} \\ \end{pmatrix} \text{,} \] then \[ \underset{\color{red}{n \times 1}}{\mathbf{y}} \; \underset{\color{red}{n \times 1}}{\mathbf{y}}^{\top} = \underset{\color{red}{n \times 1}}{\mathbf{y}} \; \underset{\color{red}{1 \times n}}{\mathbf{y}} = \underset{\color{red}{n \times n}}{\mathbf{y}} = \begin{pmatrix} y_{1}^2 & \dots & y_{1} y_{i} & \dots & y_{1} y_{n} \\ \vdots & \ddots & \vdots & & \vdots \\ y_{i} y_{1} & \dots & y_{i}^2 & \dots & y_{i} y_{n} \\ \vdots & & \vdots & \ddots & \vdots \\ y_{n} y_{1} & \dots & y_{n} y_{i} & \dots & y_{n}^2 \end{pmatrix} \text{,} \] and \[ \underset{\color{red}{n \times 1}}{\mathbf{y}}^{\top} \; \underset{\color{red}{n \times 1}}{\mathbf{x}} = \underset{\color{red}{1 \times n}}{\mathbf{y}} \; \underset{\color{red}{n \times 1}}{\mathbf{y}} = \sum_{i = 1}^{n} y_i^2 = \underset{\color{red}{1 \times 1}}{y} \text{.} \]
31.2 Matrix multiplication
Consider a matrix of this form,
\[ \underset{\color{red}{n \times k}}{\mathbf{X}} = \begin{pmatrix} x_{11} & \dots & x_{1j} & \dots & x_{1k} \\ \vdots & \ddots & \vdots & & \vdots \\ x_{11} & \dots & x_{ij} & \dots & x_{ik} \\ \vdots & & \vdots & \ddots & \vdots \\ x_{n1} & \dots & x_{nj} & \dots & x_{k} \\ \end{pmatrix} \text{,} \] then \[ \underset{\color{red}{k \times k}}{\mathbf{M}} = \underset{\color{red}{n \times k}}{\mathbf{X}}^{\top} \; \underset{\color{red}{n \times k}}{\mathbf{X}} = \begin{pmatrix} \sum_{i=1}^{n} x_{i1}^2 & \dots & \sum_{i=1}^{n} x_{i1}x_{ij} & \dots & \sum_{i=1}^{n} x_{i1}x_{ik} \\ \vdots & \ddots & \vdots & & \vdots \\ \sum_{i=1}^{n} x_{ij} x_{i1} & \dots & \sum_{i=1}^{n} x_{ij}^2 & \dots & \sum_{i=1}^{n} x_{ij} x_{ik} \\ \vdots & & \vdots & \ddots & \vdots \\ \sum_{i=1}^{n} x_{ik} x_{i1} & \dots & \sum_{i=1}^{n} x_{ik} x_{ij} & \dots & \sum_{i=1}^{n} x_{ik}^2 \\ \end{pmatrix} \text{,} \] and \[ \underset{\color{red}{k \times 1}}{\mathbf{m}} = \underset{\color{red}{n \times k}}{\mathbf{X}}^{\top} \underset{\color{red}{n \times 1}}{\mathbf{y}} = \begin{pmatrix} \sum_{i=1}^{n} x_{i1} y_{i} \\ \vdots \\ \sum_{i=1}^{n} x_{ij} y_{i} \\ \vdots \\ \sum_{i=1}^{n} x_{ik} y_{i} \\ \end{pmatrix} \text{.} \]
31.3 Special matrices
31.3.1 Basis vector
\[ \mathbf{e}_p^{\top} = \begin{pmatrix} 1 & 0 & \dots & 0 \end{pmatrix} \text{.} \tag{31.1}\] The standard basis vector of length p with 1 in the first position and 0 elsewhere.
31.3.2 Matrix of ones
In mathematics, a matrix of ones (also called an all-ones matrix) is a matrix where every entry equals 1, i.e. \[ \begin{aligned} \mathbf{J}_{2} {} & = \begin{pmatrix} 1 & 1 \\ 1 & 1 \end{pmatrix} \quad\quad\; \mathbf{J}_{3} = \begin{pmatrix} 1 & 1 & 1 \\ 1 & 1 & 1 \\ 1 & 1 & 1 \end{pmatrix} \\ \mathbf{J}_{3,2} & = \begin{pmatrix} 1 & 1 \\ 1 & 1 \\ 1 & 1 \end{pmatrix} \quad\quad \mathbf{J}_{2,3} = \begin{pmatrix} 1 & 1 & 1 \\ 1 & 1 & 1 \end{pmatrix} \end{aligned} \tag{31.2}\] In general, When two indices are provided, e.g., \(\mathbf{J}_{3,2}\), the first indicates the number of rows and the second the number of columns. When only one index is given, e.g., \(\mathbf{J}_3\), it denotes a square matrix of size \(3 \times 3\). Some basic matrix operations include:
- \(\mathbf{J}_{n,1} \cdot \mathbf{J}_{1,n} = \mathbf{J}_{n}\)
- \(\mathbf{J}_{1,n} \cdot \mathbf{J}_{n, 1} = \mathbf{J}_{1} = 1\)
31.3.3 Identity matrix
In linear algebra, the identity matrix of size \(n\) is the \(n \times n\) square matrix with ones on the main diagonal and zeros elsewhere, i.e. \[ \mathbf{I}_{1} = \begin{pmatrix} 1 \end{pmatrix}, \; \mathbf{I}_{2} = \begin{pmatrix} 1 & 0 \\ 0 & 1 \end{pmatrix}, \; \mathbf{I}_{3} = \begin{pmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{pmatrix}, \; \mathbf{I}_{3} = \begin{pmatrix} 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \end{pmatrix} \text{.} \tag{31.3}\]
31.4 Determinant
The determinant is a scalar value that can be computed from a square matrix. It provides important information about the matrix, such as whether it is invertible, its volume-scaling factor in linear transformations, and the linear dependence of its rows or columns. For a matrix \(\mathbf{A} \in \mathbb{R}^{n \times n}\), the determinant is denoted \(\det(\mathbf{A})\). Consider two matrices \(\mathbf{A}_{n\times n}\) and \(\mathbf{B}_{n\times n}\), then the determinant satisfies some properties.
- Scalar: \(a \det(\mathbf{A}) = a^n \det(\mathbf{A})\) for \(a \in \mathbb{R}\).
- Transpose: \(\det(\mathbf{A}^{\top}) = \det(\mathbf{A})\).
- Multiplication: \(\det(\mathbf{A} \mathbf{B}) = \det(\mathbf{A}) \det(\mathbf{B})\).
- Inverse: \(\det(\mathbf{A}^{-1}) = \frac{1}{\det(\mathbf{A})}\).
- Rank: if
- \(\det(\mathbf{A}) \neq 0\) then \(rank(\mathbf{A}) = \max = n\).
- \(\det(\mathbf{A}) = 0\) then \(rank(\mathbf{A}) < \max = n\).
The determinant of an \(n \times n\) matrix can be computed by expanding along any row or column. Expanding along the i-th row gives: \[ \det (\mathbf{A}) = \sum_{j = 1}^{n} (-1)^{i+j} a_{ij} \det(\mathbf{M}_{ij}) \text{.} \tag{31.4}\] where \(\mathbf{M}_{ij}\) is the sub-matrix without the \(i\)-th row and the \(j\)-th column. For example considering a \(2 \times 2\) matrix, the determinant simplifies to a well-known formula: \[ \mathbf{A} = \begin{pmatrix} {\color{red}{a}} & {\color{blue}{b}} \\ {\color{blue}{c}} & {\color{red}{d}} \end{pmatrix} \implies \det (\mathbf{A}) = {\color{red}{a}} {\color{red}{d}} - {\color{blue}{b}} {\color{blue}{c}} \text{.} \tag{31.5}\]
Let’s consider a generic \(3 \times 3\) matrix, i.e. \[ \mathbf{A} = \begin{pmatrix} a_{11} & a_{12} & a_{13} \\ a_{21} & a_{22} & a_{23} \\ a_{31} & a_{32} & a_{33} \end{pmatrix} \text{,} \] then, let’s fix the row \(i = 1\) and develop the Laplace expansion (Equation 31.4), i.e. \[ \begin{aligned} \det (\mathbf{A}) & {} = (-1)^{1+1} a_{11} \mathbf{M}_{11} + (-1)^{1+2} a_{12} \mathbf{M}_{12} + (-1)^{1+3} a_{13} \mathbf{M}_{13} = \\ & = a_{11} \mathbf{M}_{11} - a_{12} \mathbf{M}_{12} + a_{13} \mathbf{M}_{13} \end{aligned} \tag{31.6}\] where the sub-matrices \(\mathbf{M}_{11}\), \(\mathbf{M}_{12}\) and \(\mathbf{M}_{13}\) read explicitly as: \[ \mathbf{M}_{11} = \begin{pmatrix} a_{22} & a_{23} \\ a_{32} & a_{33} \end{pmatrix}, \quad \mathbf{M}_{12} = \begin{pmatrix} a_{21} & a_{23} \\ a_{31} & a_{33} \end{pmatrix}, \quad \mathbf{M}_{13} = \begin{pmatrix} a_{21} & a_{22} \\ a_{31} & a_{32} \end{pmatrix} \text{.} \] Then, the determinant of \(2\times2\) matrices is easily computable as: \[ \begin{aligned} {} & \det(\mathbf{M}_{11}) = \det \begin{pmatrix} a_{22} & a_{23} \\ a_{32} & a_{33} \end{pmatrix} = a_{22} a_{33} - a_{23} a_{32} \\ & \det(\mathbf{M}_{12}) = \det \begin{pmatrix} a_{21} & a_{23} \\ a_{31} & a_{33} \end{pmatrix} = a_{21} a_{33} - a_{23} a_{31} \\ & \det(\mathbf{M}_{13}) = \det \begin{pmatrix} a_{21} & a_{22} \\ a_{31} & a_{32} \end{pmatrix} = a_{21} a_{32} - a_{22} a_{31} \end{aligned} \tag{31.7}\] Finally, coming back to Equation 31.6 and substituting the result in Equation 31.7 one obtain: \[ \det(\mathbf{A}) = a_{11} (a_{22} a_{33} - a_{23} a_{32}) - a_{12} (a_{21} a_{33} - a_{23} a_{31}) + a_{13} (a_{21} a_{32} - a_{22} a_{31}) \text{.} \]
31.5 Trace
The trace of a square matrix \(\mathbf{A} \in \mathbb{R}^{n \times n}\) is the sum of its diagonal elements. It is also equal to the sum of the eigenvalues \(\lambda_i\) of \(\mathbf{A}\), counted with algebraic multiplicity, i.e. \[ \operatorname{tr}(\mathbf{A}) = \sum_{i = 1}^{n} a_{ii} = \sum_{i = 1}^{n} \lambda_{i} \text{.} \tag{31.8}\]
Some properties of the trace operator includes:
- \(\operatorname{tr}(\mathbf{A}^{\top}) = \operatorname{tr}(\mathbf{A})\).
- \(\operatorname{tr}(\mathbf{A} + \mathbf{B}) = \operatorname{tr}(\mathbf{A}) + \operatorname{tr}(\mathbf{B})\).
- \(\operatorname{tr}(a \mathbf{A}) = a \cdot \operatorname{tr}(\mathbf{A})\) for \(a \in \mathbb{R}\).
- \(\operatorname{tr}(\mathbf{A}^{n}) = \sum_{i = 1}^{n} \lambda_{i}^n\) where \(\lambda_i\) is the \(i\)-th eigenvalue of \(\mathbf{A}\).
- \(\operatorname{tr}(\mathbf{A}^{-1}) = \sum_{i = 1}^{n} \frac{1}{\lambda_{i}}\).
31.6 Derivatives
31.6.1 Gradient
Let \(f\) be a function such that \(f: \mathbb{R}^n \to \mathbb{R}\), namely \[ f: \underset{\mathbf{x} \in \mathbb{R}^n}{\underbrace{\begin{pmatrix} x_1 \\ \vdots \\ x_n \end{pmatrix}}} \to f(\mathbf{x}) \] then its gradient is a function \(\nabla_{f}: \mathbb{R}^n \to \mathbb{R}^n\) defined as \[ \nabla_{f}: \underset{\mathbf{x} \in \mathbb{R}^n}{\underbrace{\begin{pmatrix} x_1 \\ \vdots \\ x_n \end{pmatrix}}} \to \begin{pmatrix} \frac{\partial f}{\partial x_1} (\mathbf{x}) \\ \frac{\partial f}{\partial x_2} (\mathbf{x}) \\ \vdots \\ \frac{\partial f}{\partial x_n} (\mathbf{x}) \end{pmatrix} \tag{31.9}\] and computed at a the vector \(\mathbb{x} = (x_1, \dots, x_n)\).
31.6.2 Jacobian
Let \(\mathbf{f}\) be a multivariate function from \(\mathbb{R}^n \to \mathbb{R}^m\), i.e. \[ \mathbf{f}: \underset{\mathbf{x} \in \mathbb{R}^n}{\underbrace{\begin{pmatrix} x_1 \\ \vdots \\ x_n \end{pmatrix}}} \to \underset{\mathbf{f}(\mathbf{x}) \in \mathbb{R}^m}{\underbrace{\begin{pmatrix} f_1(\mathbf{x}) \\ \vdots \\ f_m(\mathbf{x}) \end{pmatrix}}} \] then its Jacobian is a function \(\mathbf{J}_{f}: \mathbb{R}^n \to \mathbb{R}^m \times \mathbb{R}^n\) that takes as input a vector in \(\mathbb{R}^n\) and returns as output a matrix \(m \times n\). More precisely: \[ \mathbf{J}_{f}(\mathbf{x}) = \begin{pmatrix} \nabla_{f_1}^{\top}(\mathbf{x}) \\ \vdots \\ \nabla_{f_m}^{\top}(\mathbf{x}) \end{pmatrix} = \begin{pmatrix} \frac{\partial f_1}{\partial x_1}(\mathbf{x}) & \frac{\partial f_1}{\partial x_2}(\mathbf{x}) & \dots & \frac{\partial f_1}{\partial x_n}(\mathbf{x}) \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial f_m}{\partial x_1}(\mathbf{x}) & \frac{\partial f_m}{\partial x_2}(\mathbf{x}) & \dots & \frac{\partial f_m}{\partial x_n}(\mathbf{x}) \\ \end{pmatrix} \tag{31.10}\]
31.7 Lagrange Multipliers
In order to solve a constraint minimization problem, one can minimize the Lagrangian, i.e. \[ L(\boldsymbol{\theta}, \boldsymbol{\lambda}) = f(\boldsymbol{\theta}) - \boldsymbol{\lambda}^{\top} g(\boldsymbol{\theta}) \text{,} \tag{31.11}\] where \(\boldsymbol{\lambda}\) is the vector of the Lagrange multipliers. In fact, minimizing \(L(\boldsymbol{\theta}, \boldsymbol{\lambda})\) is equivalent to find the value of \(\boldsymbol{\theta}\) that minimize \(f(\boldsymbol{\theta})\) under the constraint \(g(\boldsymbol{\theta}) = 0\). It is possible to prove that the minimum is found as: \[ \hat{\boldsymbol{\theta}} = \underset{x \in \chi}{\text{argmin}}\, L(\boldsymbol{\theta}, \boldsymbol{\lambda}) \iff \begin{cases} \partial_{\boldsymbol{\theta}} L(\hat{\boldsymbol{\theta}} , \boldsymbol{\lambda}) = 0 & (A) \\ \partial_{\boldsymbol{\lambda}} L(\hat{\boldsymbol{\theta}} , \boldsymbol{\lambda}) = 0 & (B) \end{cases} \tag{31.12}\]