31  Gaussian mixture

Setup
library(dplyr)
# required for figures  
library(ggplot2)
library(gridExtra)
# required for render latex 
library(backports)
library(latex2exp)
# required for render tables
library(knitr)
library(kableExtra)
library(solarr)
# Random seed 
set.seed(1)

Let’s consider a linear combination of a Bernoulli and two normal random variables, all assumed to be independent, i.e. (31.1)XtBtX1,t+(1Bt)X0,t, where B is a Bernoulli random variable BtBernoulli(p), and for the i-th component Xi,t=μ1+σ1Zi,t where Zi,t is standard Normal random variable. In compact form a Gaussian Mixture with two components is denoted as XtGM(μ1,μ0,σ12,σ02,p).

(a) Simulated mixture sample.
(b) Simulated (black) and true (red) density.
(c) Simulated (black) and true (red) distribution.
Figure 31.1: Gaussian Mixture simulation and density function with true parameters μ1=2, μ2=2, σ1=1, σ2=1 and p=0.5.

31.1 Distribution and density

Proposition 31.1 The distribution function of a Gaussian mixture random variable is a weighted sum between the distributions of the components, i.e.: (31.2)FX(x)=pFX1(x)+(1p)FX0(x), Taking the derivative, it can be easily shown that the density function reads: (31.3)fX(x)=pfX1(x)+(1p)fX0(x). In general (31.4)fX(x)=pσ1ϕ(xμ1σ1)+1pσ0ϕ(xμ0σ0), where Φ is the cumulative distribution function and ϕ is the density function of a standard normal random variable. An implementation of the density and distribution of a Gaussian Mixture is contained in the R package extraDistr, i.e. dmixnorm for the density and pmixnorm for the distribution.

Proof. From the formal definition of distribution function of a random variable X FX(y)=P(Xy)=E{1Xx}, where if X is a Gaussian Mixture with two components we can express it as conditional expectation with respect to B, i.e.  FX(y)=E{1Xx|B}==E{1Xx|B=0}P(B=0)+E{1XxB=1}P(B=1)==pP(X1x)+(1p)P(X0x) Hence, standardizing the Normal random variable one obtain FX(x)=pΦ(xμ1σ1)+(1p)Φ(xμ0σ0), where Φ denotes the distribution function of a standard normal. Knowing that fX(x)=dFX(x)dx and that ϕX(x)=dΦ(x)dx, where ϕ is the density function of a standard normal we obtain the result, i.e. fX(x)=pσ1ϕ(xμ1σ1)+1pσ0ϕ(xμ0σ0).

31.2 Moment generating function

Proposition 31.2 The moment generating function of a Gaussian mixture random variable () in t reads: MX(u)=pMX1(u)+(1p)MX0(u). where for a general i{0,1}, MXi(u) is the moment generating function of a Gaussian random variable with moments μi, σi2, i.e. MXi(u)=exp{μiu+u2σi22}

Proof. Applying the definition of moment generating function and the property of linearity of the expectation on a Gaussian mixture (), one obtain: MX(u)=pE{euZ1}+(1p)E{euZ0}==pMX1(u)+(1p)MX0(u) Hence, the moment generating function of X is a linear combination of the moment generating functions of the two components.

31.3 Esscher transform

Proposition 31.3 The Esscher transform of a Gaussian mixture random variable reads: Eθ{fX}(x)=p1(θ)fX1(x;θ)+p0(θ)fX0(x;θ), where for i{0,1}: fX(x;θ)=1σiϕ(xμiθσi2σi), and the distorted probabilities are defined as: p1(θ)=pMX1(θ)MX(θ),p0(θ)=(1p1(θ)).

Proof. In general, the Esscher transform of a density function fX is defined as:
Eθ{fX}(x)=eθxfX(x)MX(θ)=eθxfX(x)eθyfX(y)dy. Let’s focus only on the numerator. Substituting the density function of a Gaussian mixture one obtain: Eθ{fX}(x)=eθX(pfZ1(x)+(1p)fZ2(x))MX(θ). Let’s consider the i-component for i{0,1} and let’s explicit the density function, i.e. eθxfZi(x)=12πσi2exp{(xμi)22σi2+θx}. Let’s expand the exponent of the exponential term for the i-th component, i.e.  θx(xμ)22σ2=θxx22μx+μ22σ2==x22σ2+(θ+μσ2)xμ22σ2 Let’s denote with A=θ+μσ2, then complete the square x22σ2+Ax=12σ2[x22σ2Ax+A2σ4]+A2σ22=(xAσ2)2σ2+A2σ22, Therefore θx(xμi)22σi2=(xμiθσ2)2σi2μi22σi2+(θσi2+μ)22σi2==(xμiθσi2)2σi2μi22σi2+θ22+μi22σi2+θμi==(xμiθσi2)2σi2+θ22+θμi Hence, adding and subtracting inside the exponential one obtain eθxfZi(x)=12πσiexp{(xμi)22σi2+θxμiθθ2σi22}exp{μiθ+θ2σi22}==12πσiexp{μiθ+θσi22}exp{(xμiθσi2)22σi2}==MZi(θ)fZi(x;θ) Hence, we obtain the result fZi(x;θ)=1σiϕ(xμiθσi2σi). Hence, the Esscher density Eθ{fX}(x)=pMZ1(θ)fZ1(x;θ)+(1p)MZ0(θ)fZi(x;θ)MX(θ). Hence let’s collect the first part and the denominator (mgf of the Gaussian mixture in θ) and define the new probabilities p1(θ)=pMX1(θ)MX(θ),p0(θ)=(1p)MX0(θ)MX(θ).

(a) Esscher densities for different values of θ in [1,1] and with θ=0 (red).
(b) Distorted mixture probability for different values of θ in [1,1] and with θ=0 (red).
(c) Distorted mixture means for different values of θ in [1,1] and with θ=0 (red).
Figure 31.2: Esscher transform of a Gaussian mixture.

31.4 Moments

Proposition 31.4 The expectation of a Gaussian Mixture random variable () reads: (31.5)E{X}=pμ1+(1p)μ2, and the second moment (31.6)E{X2}=p(μ12+σ12)+(1p)(μ22+σ22). Hence, the variance
(31.7)V{X}=p(1p)(μ1μ2)2+σ12p+σ22(1p).

Proof. Given that X1, X0 and B are independent, the expectation is computed as: E{X}=E{E{XB}}==E{XB=1}P(B=1)+E{XB=0}P(B=0)==pE{X1}+(1p)E{X0}==pμ1+(1p)μ2 The second moment is computed similarly to the first one, i.e.  E{X2}=E{E{X2B}}==E{X2B=1}P(B=1)+E{X2B=0}P(B=0)==E{B}E{X12}+E{1B}E{X02}==pE{X12}+(1p)E{X02}==p(μ12+σ12)+(1p)(μ22+σ22) The variance, by definition, is given by: V{X}=E{X2}E{X}2, where the first moment squared is (31.8)E{X}2=(pμ1+(1p)μ2)2==p2μ12+(1p)2μ22+2p(1p)μ1μ2 Hence the variance, V{X}=p(μ12+σ12)+(1p)(μ22+σ22)p2μ12(1p)2μ222p(1p)μ1μ2==pμ12+pσ12+μ22+σ22pμ22pσ22p2μ12(1p)2μ222p(1p)μ1μ2==μ12p(1p)+pσ12+(1p)σ22+p(1p)μ222p(1p)μ1μ2==p(1p)(μ12μ222μ1μ2)+pσ12+(1p)σ22==p(1p)(μ1μ2)2+pσ12+(1p)σ22 Equivalently, with the law of total variance: V{X}=V{E{XB}}+E{V{XB}} where E{V{XB}}=E{σ12B+σ02(1B)}==σ12p+σ02(1p) Then E{XB}=μ1B+μ0(1B)==μ0+(μ1μ0)B and therefore V{E{XB}}=V{μ0+(μ1μ0)B}==(μ1μ0)2V{B}==(μ1μ0)2p(1p) The total variance: V{X}=V{E{XB}}+E{V{XB}}==(μ1μ0)2p(1p)+σ12p+σ02(1p)

31.4.1 Special Cases

Proposition 31.5 If the random variable X () is centered in zero, i.e. E{X}=pμ1+(1p)μ0=0, then, the following expression holds (μ1μ0)2p(1p)=pμ12+(1p)μ02

Proof. Let’s show that the following expressions LHS=p(1p)(μ1μ0)2pμ12+(1p)μ02=RHS are equivalent under the constraint E{X}=pμ1+(1p)μ0=0 Firstly let’s note that if the mixture is centered the ration between μ1 and μ0 is constant, i.e. μ1p+μ0(1p)=0μ1=μ0r where we define the ratio r as: r=(1p)p Let’s now expand the LHS and substitute the relation between μ1 and μ0: LHS=(μ1μ0)2p(1p)==p(1p)μ122p(1p)μ1μ0+p(1p)μ02==p(1p)(r2μ02+2rμ02+μ02)==p(1p)μ02(r2+2r+1)==p(1p)μ02(r+1)2 Then, we note that (r+1)=(1pp+1)=1p+pp=1pLHS=μ021pp Now, let’s consider the RHS RHS=pμ12+(1p)μ02==pr2μ02+(1p)μ02==μ02(pr2+1p)==μ021pp since (pr2+1p)=(p(1p)2p2+1p)==((1p)2+p(1p)p)==(1p)(1p+pp)==1pp Hence, the RHS and LHS are equal.

31.4.2 Central moments

Proposition 31.6 The second central moment of a Gaussian mixture reads: κ2{X}=(δ12+σ12)p+(δ02+σ02)(1p)=V{X} where for i{0,1}, δi=μiE{X}.

Proof. Developing the squares: δ12=(μ12+E{X}22μ1E{X})δ02=(μ02+E{X}22μ0E{X}) and summing δ12p+δ02(1p)=(μ12p+μ02(1p))+E{X}22(μ1p+μ0(1p))E{X}==(μ12p+μ02(1p))E{X}2 Thus, substituting the result in the initial expression one obtain the result.

31.5 Estimation

31.5.1 Maximum likelihood

Minimizing the negative log-likelihood gives an estimate of the parameters, i.e. argminμ1,μ2,σ1,σ2,p{i=1tlog(fX(xi))}, or equivalently maximizing the negative log-likelihood, i.e.  argmaxμ1,μ2,σ1,σ2,p{i=1tlog(fX(xi))}.

Maximum likelihood for Gaussian mixture
# Initialize parameters 
init_params <- par*runif(5, 0.3, 1.1)
# Log-likelihood loss function
loss <- function(params, x){
  # Parameters
  mu1 = params[1]
  mu2 = params[2]
  sd1 = params[3]
  sd2 = params[4]
  p = params[5]
  # Ensure that probability is in (0,1)
  if(p > 0.99 | p < 0.01 | sd1 < 0 | sd2 < 0){ s
    return(NA_integer_)
  }
  # Mixture density
  loglik <- function(x) log(dmixnorm(x, params[1:2], params[3:4], c(params[5], 1-params[5])))
  # Log-likelihood
  sum(loglik(x), na.rm = TRUE)
}
# Optimal parameters
# fnscale = -1 to maximize (or use negative likelihood)
ml_estimate <- optim(par = init_params, loss, 
                     x = Xt, control = list(maxit = 500000, fnscale = -1))
Table 31.1: Maximum likelihood estimates for a Gaussian Mixture.
Parameter True Estimate Log-lik Bias
μ1 -2.0 -1.9905803 -10332.56 0.0094197
μ2 2.0 2.0006381 -10332.56 0.0006381
σ1 1.0 1.0457653 -10332.56 0.0457653
σ2 1.0 1.0033373 -10332.56 0.0033373
p 0.5 0.4937459 -10332.56 -0.0062541

31.5.2 Moments matching

Let’s fix the parameter of the first component, namely μ1 and σ12 and a certain probability p. Then let’s compute the sample estimate of the expectation of Xt, i.e. E{X}=1ti=1txi=μ^ and the sample variance: V{X}=1ti=1t(xiμ^)2=σ^2 In order to obtain an estimate of the second distribution such that the Gaussian Mixture moments match exactly the sample estimates we solve the system for μ2 and σ22: {μ^=pμ1+(1p)μ2σ^2=p(1p)(μ1μ2)2+σ12p+σ22(1p) which lead to a unique solution, i.e. μ2=μ^pμ11pσ22=σ^2pσ121pp(μ1μ2)2

31.5.3 EM

To classify an existing empirical series into two groups (Bernoulli = 0 and Bernoulli = 1) such that the empirical properties (mean and variance) of the groups match the theoretical properties of the original normal distributions, we can use an Expectation-Maximization (EM) algorithm. Here we summarizes the steps and formulas used in the EM algorithm routine to classify an empirical series into two groups such that the empirical properties match the theoretical properties of two normal distributions.

Table 31.2: EM algorithm routine
Step Description
Initialization Initialize responsibilities and other parameters.
1. E-step Calculate the responsibilities for each data point as:
γi1=pf(xi|μ1,σ1)pf(xi|μ1,σ1)+(1p)f(xi|μ2,σ2),γi2=(1p)f(xi|μ2,σ2)pf(xi|μ1,σ1)+(1p)f(xi|μ2,σ2).
Compute n1=i=1nγi1 and n2=i=1nγi2.
2. M-step Update the parameters using the calculated responsibilities:
Means: μ1=1n1i=1nγi1xi,μ2=1n2i=1nγi2xi.
Variances: σ12=1n11i=1n1γi1(xiμ1)2,σ22=1n21i=1n2γi2(xiμ2)2
Bernoulli probability: p=n1n.
3. Log-likelihood Calculate the log-likelihood for convergence check.
4. Check convergence Check if the change in log-likelihood is below a threshold, otherwise come back to 1.
Output Series of Bernoulli Bt and the optimal parameters {μ1,μ2,σ1,σ2,p}.
Gaussian Mixture Estimation
# ============================== Inputs ==============================
n <- 5000
# Theoretical parameters
p <- 0.5  # Probability of Bernoulli
# Parameters for the normal distributions
mu1 <- 0  # Mean of the first normal distribution
sigma1 <- 1  # Standard deviation of the first normal distribution
mu2 <- 5  # Mean of the second normal distribution
sigma2 <- 2  # Standard deviation of the second normal distribution
true_params <- c(mu1, mu2, sigma1, sigma2, p)
# ======================================================================
# Generate a random sample 
set.seed(123)  
# Generate empirical data
B <- rbinom(n, size = 1, prob = p)
Z1 <- rnorm(n, mean = mu1, sd = sigma1)
Z2 <- rnorm(n, mean = mu2, sd = sigma2)
x <- B*Z1 + (1-B)*Z2
# ======================================================================
# Perturb the true parameters 
par <- init_params <- true_params*runif(length(true_params), min = 0.8, max = 1.2)
abstol <- 1e-6 # absolute threshold tolerance for convergence 
maxit <- 1000 # maximum iteration 
match_moments = FALSE # match empirical moments for second distribution
# ======================================================================
# Number of observations 
n_ <- length(x)
# Empiric expectation 
e_x_emp = mean(x)
# Empiric variance 
v_x_emp = var(x)
# Empiric std. deviation  
sd_x_emp = sqrt(v_x_emp)

# Initialization 
log_likelihood <- 0
previous_log_likelihood <- -Inf
responsibilities <- matrix(0, nrow = n, ncol = 2)
previous_params <- par
# EM Routine 
for (iteration in 1:maxit) {
  # 1. E-step: Calculate the responsibilities
  for (i in 1:n_) {
    responsibilities[i, 1] <- previous_params[5] * dnorm(x[i], previous_params[1], previous_params[3])
    responsibilities[i, 2] <- (1 - previous_params[5]) * dnorm(x[i], previous_params[2], previous_params[4])
  }
  # Normalize the probabilities by row
  responsibilities <- responsibilities/rowSums(responsibilities)
  
  # 2. M-step: Update the parameters
  n1 <- sum(responsibilities[, 1])
  n2 <- sum(responsibilities[, 2])
  # A) First component 
  ## Means 
  mu1 <- sum(responsibilities[, 1] * x) / n1 # means 
  ## Std. deviations 
  sigma1 <- sqrt(sum(responsibilities[, 1] * (x - mu1)^2)/(n1-1)) # std. deviation  
  # B) Second component  
  if (match_moments) {
    # Match moments approach for (mu2, sigma2)
    # Means 
    mu2 <- (e_x_emp - p * mu1) / (1 - p)
    # Std- deviations
    sigma2 <- sqrt((v_x_emp - p * sigma1^2) / (1 - p) - p * (mu1 - mu2)^2)
  } else {
    # Means
    mu2 <- sum(responsibilities[, 2] * x) / n2 
    # Std. deviations
    sigma2 <- sqrt(sum(responsibilities[, 2] * (x - mu2)^2) / (n2 - 1))  
  }
  # C) Bernoulli probability 
  p <- n1 / n_

  # 3. Calculate the log-likelihood
  log_likelihood <- sum(log(rowSums(responsibilities*cbind(p * dnorm(x, mu1, sigma1), (1 - p) * dnorm(x, mu2, sigma2)))))
  
  # 4. Check for convergence
  if (abs(log_likelihood - previous_log_likelihood) < abstol) {
    break
  }
  # Update log-likelihood
  previous_log_likelihood <- log_likelihood
  # Update parameters 
  previous_params <- c(mu1, mu2, sigma1, sigma2, p)
}
# ======================================================================
# Output 
# Mixture classification
B_hat <- ifelse(responsibilities[, 1] > responsibilities[, 2], 1, 0)
x1_hat <- x[B_hat == 1]
x2_hat <- x[B_hat == 0]
# Optimal parameters 
par <- previous_params
# Theoretical moments 
e_x_th <- par[1]*par[5] + par[2]*(1-par[5])
sd_x_th <- sqrt(par[5]*(1-par[5])*(par[1] - par[2])^2 + par[3]^2*par[5] + par[4]^2*(1-par[5]))
# Empirical parameters 
par_hat <- c(mean(x1_hat), mean(x2_hat), sd(x1_hat), sd(x2_hat), mean(B_hat))
# Empirical moments 
e_x_hat <- par_hat[1]*par_hat[5] + par_hat[2]*(1-par_hat[5])
sd_x_hat <- sqrt(par_hat[5]*(1-par_hat[5])*(par_hat[1] - par_hat[2])^2 + par_hat[3]^2*par_hat[5] + par_hat[4]^2*(1-par_hat[5]))
# ======================================================================
# Dataset containing classified series 
data = dplyr::tibble(B = B_hat, x = x, x1 = x*B, x2 = x*(1-B))
# Dataset containing initial, optimal, empirical and true parameters 
params = tibble(param = c("mu1","mu2","sd1","sd2","p"), 
                init = init_params,
                opt = par, 
                hat = par_hat,
                true = true_params)
# Dataset containing empirical and theoretical moments series 
moments = tibble(statistic = c("Mean", "Std. Dev"), 
                 emp = c(e_x_emp, sd_x_emp), 
                 opt = c(e_x_th, sd_x_th), 
                 hat = c(e_x_hat, sd_x_hat))
Table 31.3: Estimated Gaussian Mixture moments with EM
statistic emp opt hat
Mean 2.516087 2.516087 2.516087
Std. Dev 2.958034 2.957904 2.957879
Figure 31.3: Classified simulated series with EM

31.5.4 Matrix moments matching

Proposition 31.7 Any finite K-component Gaussian mixture with finite moments admits a more parsimonious moment-matching approximation with only two-component Gaussian mixture. We start with the parameters of the mixture at time t+h and we adjust them to match the first three moments of the multinomial mixture. The procedure ensures that the resulting distribution will have mean: E{Ut+hFt}=M(t,0,h)V{Ut+hFt}=S(t,0,h)E{Ut+hFt}=Ω(t,0,h) where Ω(t,0,h)=E{Ut+h3Ft}=j=0h1E{ψt+hj3Ft}E{ut+hj3} In general, the variance and the skewness do not converge to a constant number for all t, but will depends on the skewness of the starting point at t+1 till the ending point at t+h and needs to be recomputed each time. The parameters of the resulting mixture will be: μ1,t+h=Σt+h μ1,t+hμ0,t+h=Σt+h μ0,t+h and variances: (31.9)σ1,t+h2=(Σt+hpt+hμ1,t+h2(1pt+h)μ0,t+h2)3μ0,t+h(1pt+h)(Ωt+hpt+hμ1,t+h3(1pt+h)μ0,t+h3)(1pt+h)3pt+h(1pt+h)(μ0,t+hμ1,t+h) and (31.10)σ0,t+h2=(Ωt+hpt+hμ1,t+h3(1pt+h)μ0,t+h3)pt+h3μ1,t+h(Σt+hpt+hμ1,t+h2(1pt+h)μ0,t+h2)pt+h3pt+h(1pt+h)(μ0,t+hμ1,t+h)

Proof. Let’s consider the following approach. We start by fixing the mixture probabilities at time t+h, i.e. pt+h. Then, we consider the means and variances parameters of the mixture at time t+h free to vary and we adjust them to match the first three central moments of the true multinomial mixture with a two component Gaussian mixture. More precisely, let’s define: (31.11)μi,t+h=μYi(t,h),σi,t+h2=σYi2(t,h) Recalling the central moments as in ?eq-proof-ut-moments we have that, with the parameters defined as in , the resulting expected value and variance already matches the exact expectation and variance of the multinomial mixture. To improve the match between the two distributions, one can explicit also the third central moment, i.e. ωt+h=(3σ1,t+h2μ1,t+h+μ1,t+h3pt+h)+(3σ0,t+h2μ0,t+h+μ0,t+h3)(1pt+h) In this way, one can set a system to adjust the variances σi,t+h2 such that the second and the third central moments of the Gaussian mixture with two components matches the ones of the multinomial mixture. More precisely, the third central moment of yt+h reads:
κ3{yt+hFt}=j=sh1E{ψt+hj3Ft}E{ut+hj3}. Let’s denote the target moments as: Σt+h=κ2{yt+hFt}Ωt+h=κ3{yt+hFt} and let’s represent the system in matrix form (pt+h1pt+h3pt+hμ1,t+h3(1pt+hμ0,t+h))D(σ1,t+h2σ0,t+h2)Σt+h=(Σt+hpt+hμ1,t+h2(1pt+h)μ0,t+h2Ωt+hpt+hμ1,t+h3(1pt+h)μ0,t+h2)G The solution of the system has the form DΣ=GΣ=D1G. The determinant of the matrix D is different from zero only if μ1,t+hμ0,t+h, i.e.  det(D)=3pt+h(1pt+h)(μ0,t+hμ1,t+h). By applying Cramer’s rule the system can be solved explicitly for i{0,1}, i.e.  (31.12)σi,t+h2=det(Di)det(D) Where D1 is obtained by replacing the first column of D with the first column of G, i.e. D1=(Σt+hpt+hμ1,t+h2(1pt+h)μ0,t+h21pt+hΩt+hpt+hμ1,t+h3(1pt+h)μ0,t+h33(1pt+h)μ0,t+h) Then: det(D1)=(Σt+hpt+hμ1,t+h2(1pt+h)μ0,t+h2)3(1pt+h)μ0,t+h+(1pt+h)(Ωt+hpt+hμ1,t+h3(1pt+h)μ0,t+h3) Similarly for the second component D0 is obtained by replacing the second column of D with the second column of G, i.e. D0=(pt+hΣt+hpt+hμ1,t+h2(1pt+h)μ0,t+h23pt+hμ1,t+hΩt+hpt+hμ1,t+h3(1pt+h)μ0,t+h3) Then: det(D0)=pt+h(Ωt+hpt+hμ1,t+h3(1pt+h)μ0,t+h3)+3pt+hμ1,t+h(Σt+hpt+hμ1,t+h2(1pt+h)μ0,t+h2) Substituting and developing , one obtain the explicit solutions of the system.