31 Gaussian mixture

Setup

library(dplyr)
# required for figures  
library(ggplot2)
library(gridExtra)
# required for render latex 
library(backports)
library(latex2exp)
# required for render tables
library(knitr)
library(kableExtra)
library(solarr)
# Random seed 
set.seed(1)

Let’s consider a linear combination of a Bernoulli and two normal random variables, all assumed to be independent, i.e. $\begin{matrix} (31.1) & X_{t} \sim B_{t} \cdot X_{1, t} + (1 - B_{t}) \cdot X_{0, t}, \end{matrix}$ where $B$ is a Bernoulli random variable $B_{t} \sim Bernoulli (p),$ and for the $i$ -th component $X_{i, t} = μ_{1} + σ_{1} Z_{i, t}$ where $Z_{i, t}$ is standard Normal random variable. In compact form a Gaussian Mixture with two components is denoted as $X_{t} \sim G M (μ_{1}, μ_{0}, σ_{1}^{2}, σ_{0}^{2}, p)$ .

31.1 Distribution and density

Proposition 31.1 The distribution function of a Gaussian mixture random variable is a weighted sum between the distributions of the components, i.e.: $\begin{matrix} (31.2) & F_{X} (x) = p \cdot F_{X_{1}} (x) + (1 - p) F_{X_{0}} (x), \end{matrix}$ Taking the derivative, it can be easily shown that the density function reads: $\begin{matrix} (31.3) & f_{X} (x) = p \cdot f_{X_{1}} (x) + (1 - p) f_{X_{0}} (x) . \end{matrix}$ In general $\begin{matrix} (31.4) & f_{X} (x) = \frac{p}{σ_{1}} \cdot ϕ (\frac{x - μ_{1}}{σ_{1}}) + \frac{1 - p}{σ_{0}} \cdot ϕ (\frac{x - μ_{0}}{σ_{0}}), \end{matrix}$ where $Φ$ is the cumulative distribution function and $ϕ$ is the density function of a standard normal random variable. An implementation of the density and distribution of a Gaussian Mixture is contained in the R package extraDistr, i.e. dmixnorm for the density and pmixnorm for the distribution.

Proof: Proposition 31.1

Proof. From the formal definition of distribution function of a random variable $X$ $F_{X} (y) = P (X \leq y) = E {1_{X \leq x}},$ where if $X$ is a Gaussian Mixture with two components we can express it as conditional expectation with respect to $B$ , i.e. $\begin{aligned} F_{X} (y) & = E {1_{X \leq x} | B} = \\ = E {1_{X \leq x} | B = 0} P (B = 0) + E {1_{X \leq x} ∣ B = 1} P (B = 1) = \\ = p \cdot P (X_{1} \leq x) + (1 - p) \cdot P (X_{0} \leq x) \end{aligned}$ Hence, standardizing the Normal random variable one obtain $F_{X} (x) = p \cdot Φ (\frac{x - μ_{1}}{σ_{1}}) + (1 - p) \cdot Φ (\frac{x - μ_{0}}{σ_{0}}),$ where $Φ$ denotes the distribution function of a standard normal. Knowing that $f_{X} (x) = \frac{d F_{X} (x)}{d x}$ and that $ϕ_{X} (x) = \frac{d Φ (x)}{d x}$ , where $ϕ$ is the density function of a standard normal we obtain the result, i.e. $f_{X} (x) = \frac{p}{σ_{1}} \cdot ϕ (\frac{x - μ_{1}}{σ_{1}}) + \frac{1 - p}{σ_{0}} \cdot ϕ (\frac{x - μ_{0}}{σ_{0}}) .$

31.2 Moment generating function

Proposition 31.2 The moment generating function of a Gaussian mixture random variable (Equation 31.1) in $t$ reads: $M_{X} (u) = p \cdot M_{X_{1}} (u) + (1 - p) \cdot M_{X_{0}} (u) .$ where for a general $i \in {0, 1}$ , $M_{X_{i}} (u)$ is the moment generating function of a Gaussian random variable with moments $μ_{i}$ , $σ_{i}^{2}$ , i.e. $\begin{aligned} M_{X_{i}} (u) & = \exp {μ_{i} u + \frac{u^{2} σ_{i}^{2}}{2}} \end{aligned}$

Proof: Proposition 31.2

Proof. Applying the definition of moment generating function and the property of linearity of the expectation on a Gaussian mixture (Equation 31.1), one obtain: $\begin{aligned} M_{X} (u) & = p \cdot E {e^{u Z_{1}}} + (1 - p) \cdot E {e^{u Z_{0}}} = \\ = p \cdot M_{X_{1}} (u) + (1 - p) \cdot M_{X_{0}} (u) \end{aligned}$ Hence, the moment generating function of $X$ is a linear combination of the moment generating functions of the two components.

31.3 Esscher transform

Proposition 31.3 The Esscher transform of a Gaussian mixture random variable reads: $E_{θ} {f_{X}} (x) = p_{1} (θ) f_{X_{1}} (x; θ) + p_{0} (θ) f_{X_{0}} (x; θ),$ where for $i \in {0, 1}$ : $f_{X} (x; θ) = \frac{1}{σ_{i}} ϕ (\frac{x - μ_{i} - θ σ_{i}^{2}}{σ_{i}}),$ and the distorted probabilities are defined as: $p_{1} (θ) = p \cdot \frac{M_{X_{1}} (θ)}{M_{X} (θ)}, p_{0} (θ) = (1 - p_{1} (θ)) .$

Proof: Proposition 31.3

Proof. In general, the Esscher transform of a density function $f_{X}$ is defined as:
$E_{θ} {f_{X}} (x) = \frac{e^{θ x} f_{X} (x)}{M_{X} (θ)} = \frac{e^{θ x} f_{X} (x)}{\int_{- \infty}^{\infty} e^{θ y} f_{X} (y) d y} .$ Let’s focus only on the numerator. Substituting the density function of a Gaussian mixture one obtain: $E_{θ} {f_{X}} (x) = \frac{e^{θ X} (p f_{Z_{1}} (x) + (1 - p) f_{Z_{2}} (x))}{M_{X} (θ)} .$ Let’s consider the $i$ -component for $i \in {0, 1}$ and let’s explicit the density function, i.e. $e^{θ x} f_{Z_{i}} (x) = \frac{1}{\sqrt{2 π} σ_{i}^{2}} \exp {- \frac{(x - μ_{i})^{2}}{2 σ_{i}^{2}} + θ x} .$ Let’s expand the exponent of the exponential term for the $i$ -th component, i.e. $\begin{aligned} θ x - \frac{(x - μ)^{2}}{2 σ^{2}} & = θ x - \frac{x^{2} - 2 μ x + μ^{2}}{2 σ^{2}} = \\ = - \frac{x^{2}}{2 σ^{2}} + (θ + \frac{μ}{σ^{2}}) x - \frac{μ^{2}}{2 σ^{2}} \end{aligned}$ Let’s denote with $A = θ + \frac{μ}{σ^{2}}$ , then complete the square $- \frac{x^{2}}{2 σ^{2}} + A x = - \frac{1}{2 σ^{2}} [x^{2} - 2 σ^{2} A x + A^{2} σ^{4}] + \frac{A^{2} σ^{2}}{2} = \frac{(x - A σ^{2})}{2 σ^{2}} + \frac{A^{2} σ^{2}}{2},$ Therefore $\begin{aligned} θ x - \frac{(x - μ_{i})^{2}}{2 σ_{i}^{2}} & = \frac{(x - μ_{i} - θ σ^{2})}{2 σ_{i}^{2}} - \frac{μ_{i}^{2}}{2 σ_{i}^{2}} + \frac{(θ σ_{i}^{2} + μ)^{2}}{2 σ_{i}^{2}} = \\ = \frac{(x - μ_{i} - θ σ_{i}^{2})}{2 σ_{i}^{2}} - \frac{μ_{i}^{2}}{2 σ_{i}^{2}} + \frac{θ^{2}}{2} + \frac{μ_{i}^{2}}{2 σ_{i}^{2}} + θ μ_{i} = \\ = \frac{(x - μ_{i} - θ σ_{i}^{2})}{2 σ_{i}^{2}} + \frac{θ^{2}}{2} + θ μ_{i} \end{aligned}$ Hence, adding and subtracting inside the exponential one obtain $\begin{aligned} e^{θ x} f_{Z_{i}} (x) & = \frac{1}{\sqrt{2 π} σ_{i}} \exp {- \frac{(x - μ_{i})^{2}}{2 σ_{i}^{2}} + θ x - μ_{i} θ - \frac{θ^{2} σ_{i}^{2}}{2}} \exp {μ_{i} θ + \frac{θ^{2} σ_{i}^{2}}{2}} = \\ = \frac{1}{\sqrt{2 π} σ_{i}} \exp {μ_{i} θ + \frac{θ σ_{i}^{2}}{2}} \exp {- \frac{(x - μ_{i} - θ σ_{i}^{2})^{2}}{2 σ_{i}^{2}}} = \\ = M_{Z_{i}} (θ) f_{Z_{i}} (x; θ) \end{aligned}$ Hence, we obtain the result $f_{Z_{i}} (x; θ) = \frac{1}{σ_{i}} ϕ (\frac{x - μ_{i} - θ σ_{i}^{2}}{σ_{i}}) .$ Hence, the Esscher density $E_{θ} {f_{X}} (x) = \frac{p M_{Z_{1}} (θ) f_{Z_{1}} (x; θ) + (1 - p) M_{Z_{0}} (θ) f_{Z_{i}} (x; θ)}{M_{X} (θ)} .$ Hence let’s collect the first part and the denominator (mgf of the Gaussian mixture in $θ$ ) and define the new probabilities $p_{1} (θ) = \frac{p M_{X_{1}} (θ)}{M_{X} (θ)}, p_{0} (θ) = \frac{(1 - p) M_{X_{0}} (θ)}{M_{X} (θ)} .$

(a) Esscher densities for different values of $θ$ in $[- 1, 1]$ and with $θ = 0$ (red).

31.4 Moments

Proposition 31.4 The expectation of a Gaussian Mixture random variable (Equation 31.1) reads: $\begin{matrix} (31.5) & E {X} = p μ_{1} + (1 - p) μ_{2}, \end{matrix}$ and the second moment $\begin{matrix} (31.6) & E {X^{2}} = p (μ_{1}^{2} + σ_{1}^{2}) + (1 - p) (μ_{2}^{2} + σ_{2}^{2}) . \end{matrix}$ Hence, the variance
$\begin{matrix} (31.7) & V {X} = p (1 - p) (μ_{1} - μ_{2})^{2} + σ_{1}^{2} p + σ_{2}^{2} (1 - p) . \end{matrix}$

Proof: Proposition 31.4

Proof. Given that $X_{1}$ , $X_{0}$ and $B$ are independent, the expectation is computed as: $\begin{aligned} E {X} & = E {E {X ∣ B}} = \\ = E {X ∣ B = 1} P (B = 1) + E {X ∣ B = 0} P (B = 0) = \\ = p E {X_{1}} + (1 - p) E {X_{0}} = \\ = p μ_{1} + (1 - p) μ_{2} \end{aligned}$ The second moment is computed similarly to the first one, i.e. $\begin{aligned} E {X^{2}} & = E {E {X^{2} ∣ B}} = \\ = E {X^{2} ∣ B = 1} P (B = 1) + E {X^{2} ∣ B = 0} P (B = 0) = \\ = E {B} E {X_{1}^{2}} + E {1 - B} E {X_{0}^{2}} = \\ = p E {X_{1}^{2}} + (1 - p) E {X_{0}^{2}} = \\ = p (μ_{1}^{2} + σ_{1}^{2}) + (1 - p) (μ_{2}^{2} + σ_{2}^{2}) \end{aligned}$ The variance, by definition, is given by: $V {X} = E {X^{2}} - E {X}^{2},$ where the first moment squared is $\begin{matrix} (31.8) & \begin{aligned} E {X}^{2} & = {(p μ_{1} + (1 - p) μ_{2})}^{2} = \\ = p^{2} μ_{1}^{2} + (1 - p)^{2} μ_{2}^{2} + 2 p (1 - p) μ_{1} μ_{2} \end{aligned} \end{matrix}$ Hence the variance, $\begin{aligned} V {X} & = p (μ_{1}^{2} + σ_{1}^{2}) + (1 - p) (μ_{2}^{2} + σ_{2}^{2}) - p^{2} μ_{1}^{2} - (1 - p)^{2} μ_{2}^{2} - 2 p (1 - p) μ_{1} μ_{2} = \\ = p μ_{1}^{2} + p σ_{1}^{2} + μ_{2}^{2} + σ_{2}^{2} - p μ_{2}^{2} - p σ_{2}^{2} - p^{2} μ_{1}^{2} - (1 - p)^{2} μ_{2}^{2} - 2 p (1 - p) μ_{1} μ_{2} = \\ = μ_{1}^{2} p (1 - p) + p σ_{1}^{2} + (1 - p) σ_{2}^{2} + p (1 - p) μ_{2}^{2} - 2 p (1 - p) μ_{1} μ_{2} = \\ = p (1 - p) (μ_{1}^{2} - μ_{2}^{2} - 2 μ_{1} μ_{2}) + p σ_{1}^{2} + (1 - p) σ_{2}^{2} = \\ = p (1 - p) (μ_{1} - μ_{2})^{2} + p σ_{1}^{2} + (1 - p) σ_{2}^{2} \end{aligned}$ Equivalently, with the law of total variance: $V {X} = V {E {X ∣ B}} + E {V {X ∣ B}}$ where $\begin{aligned} E {V {X ∣ B}} & = E {σ_{1}^{2} B + σ_{0}^{2} (1 - B)} = \\ = σ_{1}^{2} p + σ_{0}^{2} (1 - p) \end{aligned}$ Then $\begin{aligned} E {X ∣ B} & = μ_{1} B + μ_{0} (1 - B) = \\ = μ_{0} + (μ_{1} - μ_{0}) B \end{aligned}$ and therefore $\begin{aligned} V {E {X ∣ B}} & = V {μ_{0} + (μ_{1} - μ_{0}) B} = \\ = (μ_{1} - μ_{0})^{2} V {B} = \\ = (μ_{1} - μ_{0})^{2} p (1 - p) \end{aligned}$ The total variance: $\begin{aligned} V {X} & = V {E {X ∣ B}} + E {V {X ∣ B}} = \\ = (μ_{1} - μ_{0})^{2} p (1 - p) + σ_{1}^{2} p + σ_{0}^{2} (1 - p) \end{aligned}$

31.4.1 Special Cases

Proposition 31.5 If the random variable $X$ (Equation 31.1) is centered in zero, i.e. $E {X} = p μ_{1} + (1 - p) μ_{0} = 0,$ then, the following expression holds $(μ_{1} - μ_{0})^{2} p (1 - p) = p μ_{1}^{2} + (1 - p) μ_{0}^{2}$

Proof: Proposition 31.5

Proof. Let’s show that the following expressions $LHS = p (1 - p) (μ_{1} - μ_{0})^{2} \equiv p μ_{1}^{2} + (1 - p) μ_{0}^{2} = RHS$ are equivalent under the constraint $E {X} = p μ_{1} + (1 - p) μ_{0} = 0$ Firstly let’s note that if the mixture is centered the ration between $μ_{1}$ and $μ_{0}$ is constant, i.e. $μ_{1} p + μ_{0} (1 - p) = 0 ⟹ μ_{1} = - μ_{0} r$ where we define the ratio $r$ as: $r = \frac{(1 - p)}{p}$ Let’s now expand the LHS and substitute the relation between $μ_{1}$ and $μ_{0}$ : $\begin{aligned} LHS & = (μ_{1} - μ_{0})^{2} p (1 - p) = \\ = p (1 - p) μ_{1}^{2} - 2 p (1 - p) μ_{1} μ_{0} + p (1 - p) μ_{0}^{2} = \\ = p (1 - p) (r^{2} μ_{0}^{2} + 2 r μ_{0}^{2} + μ_{0}^{2}) = \\ = p (1 - p) μ_{0}^{2} (r^{2} + 2 r + 1) = \\ = p (1 - p) μ_{0}^{2} (r + 1)^{2} \end{aligned}$ Then, we note that $(r + 1) = (\frac{1 - p}{p} + 1) = \frac{1 - p + p}{p} = \frac{1}{p} ⟹ LHS = μ_{0}^{2} \frac{1 - p}{p}$ Now, let’s consider the RHS $\begin{aligned} RHS & = p μ_{1}^{2} + (1 - p) μ_{0}^{2} = \\ = p r^{2} μ_{0}^{2} + (1 - p) μ_{0}^{2} = \\ = μ_{0}^{2} (p r^{2} + 1 - p) = \\ = μ_{0}^{2} \frac{1 - p}{p} \end{aligned}$ since $\begin{aligned} (p r^{2} + 1 - p) & = (p \frac{(1 - p)^{2}}{p^{2}} + 1 - p) = \\ = (\frac{(1 - p)^{2} + p (1 - p)}{p}) = \\ = (1 - p) (\frac{1 - p + p}{p}) = \\ = \frac{1 - p}{p} \end{aligned}$ Hence, the RHS and LHS are equal.

31.4.2 Central moments

Proposition 31.6 The second central moment of a Gaussian mixture reads: $κ_{2} {X} = (δ_{1}^{2} + σ_{1}^{2}) p + (δ_{0}^{2} + σ_{0}^{2}) (1 - p) = V {X}$ where for $i \in {0, 1}$ , $δ_{i} = μ_{i} - E {X}$ .

Proof: Proposition 31.6

Proof. Developing the squares: $\begin{aligned} δ_{1}^{2} = (μ_{1}^{2} + E {X}^{2} - 2 μ_{1} E {X}) \\ δ_{0}^{2} = (μ_{0}^{2} + E {X}^{2} - 2 μ_{0} E {X}) \end{aligned}$ and summing $\begin{aligned} δ_{1}^{2} p + δ_{0}^{2} (1 - p) = (μ_{1}^{2} p + μ_{0}^{2} (1 - p)) + E {X}^{2} - 2 (μ_{1} p + μ_{0} (1 - p)) E {X} = \\ = (μ_{1}^{2} p + μ_{0}^{2} (1 - p)) - E {X}^{2} \end{aligned}$ Thus, substituting the result in the initial expression one obtain the result.

31.5 Estimation

31.5.1 Maximum likelihood

Minimizing the negative log-likelihood gives an estimate of the parameters, i.e. $\underset{μ_{1}, μ_{2}, σ_{1}, σ_{2}, p}{argmin} {\sum_{i = 1}^{t} \log (f_{X} (x_{i}))},$ or equivalently maximizing the negative log-likelihood, i.e. $\underset{μ_{1}, μ_{2}, σ_{1}, σ_{2}, p}{argmax} {- \sum_{i = 1}^{t} \log (f_{X} (x_{i}))} .$

Example: ML-estimate

Maximum likelihood for Gaussian mixture

# Initialize parameters 
init_params <- par*runif(5, 0.3, 1.1)
# Log-likelihood loss function
loss <- function(params, x){
  # Parameters
  mu1 = params[1]
  mu2 = params[2]
  sd1 = params[3]
  sd2 = params[4]
  p = params[5]
  # Ensure that probability is in (0,1)
  if(p > 0.99 | p < 0.01 | sd1 < 0 | sd2 < 0){ s
    return(NA_integer_)
  }
  # Mixture density
  loglik <- function(x) log(dmixnorm(x, params[1:2], params[3:4], c(params[5], 1-params[5])))
  # Log-likelihood
  sum(loglik(x), na.rm = TRUE)
}
# Optimal parameters
# fnscale = -1 to maximize (or use negative likelihood)
ml_estimate <- optim(par = init_params, loss, 
                     x = Xt, control = list(maxit = 500000, fnscale = -1))

Table 31.1: Maximum likelihood estimates for a Gaussian Mixture.

Parameter	True	Estimate	Log-lik	Bias
$μ_{1}$	-2.0	-1.9905803	-10332.56	0.0094197
$μ_{2}$	2.0	2.0006381	-10332.56	0.0006381
$σ_{1}$	1.0	1.0457653	-10332.56	0.0457653
$σ_{2}$	1.0	1.0033373	-10332.56	0.0033373
$p$	0.5	0.4937459	-10332.56	-0.0062541

31.5.2 Moments matching

Let’s fix the parameter of the first component, namely $μ_{1}$ and $σ_{1}^{2}$ and a certain probability $p$ . Then let’s compute the sample estimate of the expectation of $X_{t}$ , i.e. $E {X} = \frac{1}{t} \sum_{i = 1}^{t} x_{i} = \hat{μ}$ and the sample variance: $V {X} = \frac{1}{t} \sum_{i = 1}^{t} (x_{i} - \hat{μ})^{2} = {\hat{σ}}^{2}$ In order to obtain an estimate of the second distribution such that the Gaussian Mixture moments match exactly the sample estimates we solve the system for $μ_{2}$ and $σ_{2}^{2}$ : ${\begin{cases} \hat{μ} = p μ_{1} + (1 - p) μ_{2} \\ {\hat{σ}}^{2} = p (1 - p) (μ_{1} - μ_{2})^{2} + σ_{1}^{2} p + σ_{2}^{2} (1 - p) \end{cases}$ which lead to a unique solution, i.e. $\begin{aligned} μ_{2} = \frac{\hat{μ} - p μ_{1}}{1 - p} \\ σ_{2}^{2} = \frac{{\hat{σ}}^{2} - p σ_{1}^{2}}{1 - p} - p (μ_{1} - μ_{2})^{2} \end{aligned}$

31.5.3 EM

To classify an existing empirical series into two groups (Bernoulli = 0 and Bernoulli = 1) such that the empirical properties (mean and variance) of the groups match the theoretical properties of the original normal distributions, we can use an Expectation-Maximization (EM) algorithm. Here we summarizes the steps and formulas used in the EM algorithm routine to classify an empirical series into two groups such that the empirical properties match the theoretical properties of two normal distributions.

Table 31.2: EM algorithm routine

Step	Description
Initialization	Initialize responsibilities and other parameters.
1. E-step	Calculate the responsibilities for each data point as:
	$γ_{i 1} = \frac{p \cdot f (x_{i} \| μ_{1}, σ_{1})}{p \cdot f (x_{i} \| μ_{1}, σ_{1}) + (1 - p) \cdot f (x_{i} \| μ_{2}, σ_{2})}, γ_{i 2} = \frac{(1 - p) \cdot f (x_{i} \| μ_{2}, σ_{2})}{p \cdot f (x_{i} \| μ_{1}, σ_{1}) + (1 - p) \cdot f (x_{i} \| μ_{2}, σ_{2})}$ .
	Compute $n_{1} = \sum_{i = 1}^{n} γ_{i 1}$ and $n_{2} = \sum_{i = 1}^{n} γ_{i 2}$ .
2. M-step	Update the parameters using the calculated responsibilities:
	Means: $μ_{1} = \frac{1}{n_{1}} \sum_{i = 1}^{n} γ_{i 1} x_{i}, μ_{2} = \frac{1}{n_{2}} \sum_{i = 1}^{n} γ_{i 2} x_{i}$ .
	Variances: $σ_{1}^{2} = \frac{1}{n_{1} - 1} \sum_{i = 1}^{n_{1}} γ_{i 1} (x_{i} - μ_{1})^{2}, σ_{2}^{2} = \frac{1}{n_{2} - 1} \sum_{i = 1}^{n_{2}} γ_{i 2} (x_{i} - μ_{2})^{2}$
	Bernoulli probability: $p = \frac{n_{1}}{n}$ .
3. Log-likelihood	Calculate the log-likelihood for convergence check.
4. Check convergence	Check if the change in log-likelihood is below a threshold, otherwise come back to 1.
Output	Series of Bernoulli $B_{t}$ and the optimal parameters ${μ_{1}, μ_{2}, σ_{1}, σ_{2}, p}$ .

Gaussian Mixture Estimation

# ============================== Inputs ==============================
n <- 5000
# Theoretical parameters
p <- 0.5  # Probability of Bernoulli
# Parameters for the normal distributions
mu1 <- 0  # Mean of the first normal distribution
sigma1 <- 1  # Standard deviation of the first normal distribution
mu2 <- 5  # Mean of the second normal distribution
sigma2 <- 2  # Standard deviation of the second normal distribution
true_params <- c(mu1, mu2, sigma1, sigma2, p)
# ======================================================================
# Generate a random sample 
set.seed(123)  
# Generate empirical data
B <- rbinom(n, size = 1, prob = p)
Z1 <- rnorm(n, mean = mu1, sd = sigma1)
Z2 <- rnorm(n, mean = mu2, sd = sigma2)
x <- B*Z1 + (1-B)*Z2
# ======================================================================
# Perturb the true parameters 
par <- init_params <- true_params*runif(length(true_params), min = 0.8, max = 1.2)
abstol <- 1e-6 # absolute threshold tolerance for convergence 
maxit <- 1000 # maximum iteration 
match_moments = FALSE # match empirical moments for second distribution
# ======================================================================
# Number of observations 
n_ <- length(x)
# Empiric expectation 
e_x_emp = mean(x)
# Empiric variance 
v_x_emp = var(x)
# Empiric std. deviation  
sd_x_emp = sqrt(v_x_emp)

# Initialization 
log_likelihood <- 0
previous_log_likelihood <- -Inf
responsibilities <- matrix(0, nrow = n, ncol = 2)
previous_params <- par
# EM Routine 
for (iteration in 1:maxit) {
  # 1. E-step: Calculate the responsibilities
  for (i in 1:n_) {
    responsibilities[i, 1] <- previous_params[5] * dnorm(x[i], previous_params[1], previous_params[3])
    responsibilities[i, 2] <- (1 - previous_params[5]) * dnorm(x[i], previous_params[2], previous_params[4])
  }
  # Normalize the probabilities by row
  responsibilities <- responsibilities/rowSums(responsibilities)
  
  # 2. M-step: Update the parameters
  n1 <- sum(responsibilities[, 1])
  n2 <- sum(responsibilities[, 2])
  # A) First component 
  ## Means 
  mu1 <- sum(responsibilities[, 1] * x) / n1 # means 
  ## Std. deviations 
  sigma1 <- sqrt(sum(responsibilities[, 1] * (x - mu1)^2)/(n1-1)) # std. deviation  
  # B) Second component  
  if (match_moments) {
    # Match moments approach for (mu2, sigma2)
    # Means 
    mu2 <- (e_x_emp - p * mu1) / (1 - p)
    # Std- deviations
    sigma2 <- sqrt((v_x_emp - p * sigma1^2) / (1 - p) - p * (mu1 - mu2)^2)
  } else {
    # Means
    mu2 <- sum(responsibilities[, 2] * x) / n2 
    # Std. deviations
    sigma2 <- sqrt(sum(responsibilities[, 2] * (x - mu2)^2) / (n2 - 1))  
  }
  # C) Bernoulli probability 
  p <- n1 / n_

  # 3. Calculate the log-likelihood
  log_likelihood <- sum(log(rowSums(responsibilities*cbind(p * dnorm(x, mu1, sigma1), (1 - p) * dnorm(x, mu2, sigma2)))))
  
  # 4. Check for convergence
  if (abs(log_likelihood - previous_log_likelihood) < abstol) {
    break
  }
  # Update log-likelihood
  previous_log_likelihood <- log_likelihood
  # Update parameters 
  previous_params <- c(mu1, mu2, sigma1, sigma2, p)
}
# ======================================================================
# Output 
# Mixture classification
B_hat <- ifelse(responsibilities[, 1] > responsibilities[, 2], 1, 0)
x1_hat <- x[B_hat == 1]
x2_hat <- x[B_hat == 0]
# Optimal parameters 
par <- previous_params
# Theoretical moments 
e_x_th <- par[1]*par[5] + par[2]*(1-par[5])
sd_x_th <- sqrt(par[5]*(1-par[5])*(par[1] - par[2])^2 + par[3]^2*par[5] + par[4]^2*(1-par[5]))
# Empirical parameters 
par_hat <- c(mean(x1_hat), mean(x2_hat), sd(x1_hat), sd(x2_hat), mean(B_hat))
# Empirical moments 
e_x_hat <- par_hat[1]*par_hat[5] + par_hat[2]*(1-par_hat[5])
sd_x_hat <- sqrt(par_hat[5]*(1-par_hat[5])*(par_hat[1] - par_hat[2])^2 + par_hat[3]^2*par_hat[5] + par_hat[4]^2*(1-par_hat[5]))
# ======================================================================
# Dataset containing classified series 
data = dplyr::tibble(B = B_hat, x = x, x1 = x*B, x2 = x*(1-B))
# Dataset containing initial, optimal, empirical and true parameters 
params = tibble(param = c("mu1","mu2","sd1","sd2","p"), 
                init = init_params,
                opt = par, 
                hat = par_hat,
                true = true_params)
# Dataset containing empirical and theoretical moments series 
moments = tibble(statistic = c("Mean", "Std. Dev"), 
                 emp = c(e_x_emp, sd_x_emp), 
                 opt = c(e_x_th, sd_x_th), 
                 hat = c(e_x_hat, sd_x_hat))

Table 31.3: Estimated Gaussian Mixture moments with EM

statistic	emp	opt	hat
Mean	2.516087	2.516087	2.516087
Std. Dev	2.958034	2.957904	2.957879

Figure 31.3: Classified simulated series with EM

31.5.4 Matrix moments matching

Proposition 31.7 Any finite K-component Gaussian mixture with finite moments admits a more parsimonious moment-matching approximation with only two-component Gaussian mixture. We start with the parameters of the mixture at time $t + h$ and we adjust them to match the first three moments of the multinomial mixture. The procedure ensures that the resulting distribution will have mean: $\begin{aligned} E {U_{t + h} ∣ F_{t}} = M (t, 0, h) \\ V {U_{t + h} ∣ F_{t}} = S (t, 0, h) \\ E {U_{t + h} ∣ F_{t}} = Ω (t, 0, h) \end{aligned}$ where $Ω (t, 0, h) = E {U_{t + h}^{3} ∣ F_{t}} = \sum_{j = 0}^{h - 1} E {ψ_{t + h - j}^{3} ∣ F_{t}} E {u_{t + h - j}^{3}}$ In general, the variance and the skewness do not converge to a constant number for all $t$ , but will depends on the skewness of the starting point at $t + 1$ till the ending point at $t + h$ and needs to be recomputed each time. The parameters of the resulting mixture will be: $μ_{1, t + h}^{⋆} = \sqrt{Σ_{t + h}} μ_{1, t + h} μ_{0, t + h}^{⋆} = \sqrt{Σ_{t + h}} μ_{0, t + h}$ and variances: $\begin{matrix} (31.9) & \begin{aligned} σ_{1, t + h}^{⋆ 2} & = \frac{(Σ_{t + h} - p_{t + h} μ_{1, t + h}^{⋆ 2} - (1 - p_{t + h}) μ_{0, t + h}^{⋆ 2}) \cdot 3 μ_{0, t + h}^{⋆} (1 - p_{t + h}) - (Ω_{t + h} - p_{t + h} μ_{1, t + h}^{⋆ 3} - (1 - p_{t + h}) μ_{0, t + h}^{⋆ 3}) (1 - p_{t + h})}{3 p_{t + h} (1 - p_{t + h}) (μ_{0, t + h}^{⋆} - μ_{1, t + h}^{⋆})} \end{aligned} \end{matrix}$ and $\begin{matrix} (31.10) & \begin{aligned} σ_{0, t + h}^{⋆ 2} & = \frac{(Ω_{t + h} - p_{t + h} μ_{1, t + h}^{3 *} - (1 - p_{t + h}) μ_{0, t + h}^{3 *}) p_{t + h} - 3 μ_{1, t + h}^{⋆} (Σ_{t + h} - p_{t + h} μ_{1, t + h}^{⋆ 2} - (1 - p_{t + h}) μ_{0, t + h}^{2 *}) p_{t + h}}{3 p_{t + h} (1 - p_{t + h}) (μ_{0, t + h}^{*} - μ_{1, t + h}^{*})} \end{aligned} \end{matrix}$

Proof. Let’s consider the following approach. We start by fixing the mixture probabilities at time $t + h$ , i.e. $p_{t + h}$ . Then, we consider the means and variances parameters of the mixture at time $t + h$ free to vary and we adjust them to match the first three central moments of the true multinomial mixture with a two component Gaussian mixture. More precisely, let’s define: $\begin{matrix} (31.11) & μ_{i, t + h}^{*} = μ_{Y_{i}} (t, h), σ_{i, t + h}^{2 *} = σ_{Y_{i}}^{2} (t, h) \end{matrix}$ Recalling the central moments as in ?eq-proof-ut-moments we have that, with the parameters defined as in Equation 31.11, the resulting expected value and variance already matches the exact expectation and variance of the multinomial mixture. To improve the match between the two distributions, one can explicit also the third central moment, i.e. $ω_{t + h} = (3 σ_{1, t + h}^{2 *} μ_{1, t + h}^{*} + μ_{1, t + h}^{3 *} p_{t + h}) + (3 σ_{0, t + h}^{2 *} μ_{0, t + h}^{*} + μ_{0, t + h}^{3 *}) (1 - p_{t + h})$ In this way, one can set a system to adjust the variances $σ_{i, t + h}^{* 2}$ such that the second and the third central moments of the Gaussian mixture with two components matches the ones of the multinomial mixture. More precisely, the third central moment of $y_{t + h}$ reads:
$κ_{3} {y_{t + h} ∣ F_{t}} = \sum_{j = s}^{h - 1} E {ψ_{t + h - j}^{3} ∣ F_{t}} E {u_{t + h - j}^{3}} .$ Let’s denote the target moments as: $Σ_{t + h} = κ_{2} {y_{t + h} ∣ F_{t}} Ω_{t + h} = κ_{3} {y_{t + h} ∣ F_{t}}$ and let’s represent the system in matrix form $\underset{D}{\underset{⏟}{(\begin{matrix} p_{t + h} & 1 - p_{t + h} \\ 3 p_{t + h} μ_{1, t + h}^{*} & 3 (1 - p_{t + h} μ_{0, t + h}^{*}) \end{matrix})}} \underset{Σ_{t + h}^{⋆}}{\underset{⏟}{(\begin{matrix} σ_{1, t + h}^{2 ⋆} \\ σ_{0, t + h}^{2 ⋆} \end{matrix})}} = \underset{G}{\underset{⏟}{(\begin{matrix} Σ_{t + h} - p_{t + h} μ_{1, t + h}^{2 *} - (1 - p_{t + h}) μ_{0, t + h}^{2 *} \\ Ω_{t + h} - p_{t + h} μ_{1, t + h}^{3 *} - (1 - p_{t + h}) μ_{0, t + h}^{2 *} \end{matrix})}}$ The solution of the system has the form $D Σ^{⋆} = G ⟺ Σ^{⋆} = D^{- 1} G$ . The determinant of the matrix $D$ is different from zero only if $μ_{1, t + h}^{*} \neq μ_{0, t + h}^{*}$ , i.e. $det (D) = 3 p_{t + h} (1 - p_{t + h}) (μ_{0, t + h}^{*} - μ_{1, t + h}^{*}) .$ By applying Cramer’s rule the system can be solved explicitly for $i \in {0, 1}$ , i.e. $\begin{matrix} (31.12) & σ_{i, t + h}^{2 ⋆} = \frac{det (D_{i})}{det (D)} \end{matrix}$ Where $D_{1}$ is obtained by replacing the first column of $D$ with the first column of $G$ , i.e. $D_{1} = (\begin{matrix} Σ_{t + h} - p_{t + h} μ_{1, t + h}^{2 *} - (1 - p_{t + h}) μ_{0, t + h}^{2 *} & 1 - p_{t + h} \\ Ω_{t + h} - p_{t + h} μ_{1, t + h}^{3 ⋆} - (1 - p_{t + h}) μ_{0, t + h}^{3 *} & 3 (1 - p_{t + h}) μ_{0, t + h}^{*} \end{matrix})$ Then: $\begin{aligned} det (D_{1}) & = (Σ_{t + h} - p_{t + h} μ_{1, t + h}^{⋆ 2} - (1 - p_{t + h}) μ_{0, t + h}^{⋆ 2}) \cdot 3 (1 - p_{t + h}) μ_{0, t + h}^{⋆} + \\ - (1 - p_{t + h}) (Ω_{t + h} - p_{t + h} μ_{1, t + h}^{⋆ 3} - (1 - p_{t + h}) μ_{0, t + h}^{⋆ 3}) \end{aligned}$ Similarly for the second component $D_{0}$ is obtained by replacing the second column of $D$ with the second column of $G$ , i.e. $D_{0} = (\begin{matrix} p_{t + h} & Σ_{t + h} - p_{t + h} μ_{1, t + h}^{⋆ 2} - (1 - p_{t + h}) μ_{0, t + h}^{⋆ 2} \\ 3 p_{t + h} μ_{1, t + h}^{⋆} & Ω_{t + h} - p_{t + h} μ_{1, t + h}^{⋆ 3} - (1 - p_{t + h}) μ_{0, t + h}^{⋆ 3} \end{matrix})$ Then: $\begin{aligned} det (D_{0}) & = p_{t + h} \cdot (Ω_{t + h} - p_{t + h} μ_{1, t + h}^{⋆ 3} - (1 - p_{t + h}) μ_{0, t + h}^{⋆ 3}) + \\ - 3 p_{t + h} μ_{1, t + h}^{⋆} \cdot (Σ_{t + h} - p_{t + h} μ_{1, t + h}^{⋆ 2} - (1 - p_{t + h}) μ_{0, t + h}^{⋆ 2}) \end{aligned}$ Substituting and developing Equation 31.12, one obtain the explicit solutions of the system.