10 Moments

10.1 Expectation

The expectation of a random variable $X$ is it’s first moment, also called statistical average. In general, it is denoted as $E {X}$ . Let’s consider a discrete random variable $X$ with distribution function $P (X = x_{j}) = p_{j}$ . Then the expectation of $X$ is the weighted average between all the possible $m$ -states that the random variable can assume by it’s respective probability of occurrence, i.e. $E {X} = \sum_{j = 1}^{m} x_{j} p_{j} .$ In the continuous case, i.e. when $X$ takes values in $R$ and admits a density function, the expectation is computed as an integral, i.e. $E {X} = \int_{- \infty}^{\infty} x d F_{X} (x) = \int_{- \infty}^{\infty} x f_{X} (x) d x .$

10.1.1 Sample statistic

Let’s consider a sample of IID observations, i.e. $X_{n} = (x_{1}, \dots, x_{i}, \dots, x_{n})$ . Then the sample expectation is computed as: $\hat{μ} (X_{n}) = \frac{1}{n} \sum_{i = 1}^{n} x_{i} .$

Population vs sample

In general, the notation $X_{n}$ refers to a finite sample, e.g. $\hat{μ} (X_{n})$ is the sample mean. Instead the notation without $n$ , i.e. $X$ , stands for the random variable in population, e.g. $E {X}$ is the mean in population. A population can be finite or non-finite. In the case of a finite population with $N$ element it is useful to distinguish between:

Extraction with reimmission of $n$ elements for the sample gives $N^{n}$ possible combinations.
Extraction without readmission of $n$ elements for the sample gives $(\binom{N}{n})$ possible combinations.

Table 10.1: Expectation in a discrete and continuous population and in a sample

X_{n}

Population (continuous)	Population (discrete)	Sample
$\int_{- \infty}^{\infty} x f (x) d x$	$\sum_{j = 1}^{m} x_{j} p_{j}$	$\frac{1}{n} \sum_{i = 1}^{n} x_{i}$

10.1.2 Sample moments

Let’s consider an the moments of the sample mean of an IID sample. Since all the variables has the same expected value, i.e. $E {x_{i}} = E {X}$ , the expected value of the sample mean is computed as: $\begin{matrix} (10.1) & E {\hat{μ} (X_{n})} = \frac{1}{n} \sum_{i = 1}^{n} E {x_{i}} = E {X} . \end{matrix}$ The variance of the sample mean is computed as: $\begin{matrix} (10.2) & \begin{aligned} V {\hat{μ} (X_{n})} & = \frac{1}{n^{2}} V {\sum_{i = 1}^{n} x_{i}} = \\ = \frac{1}{n^{2}} \sum_{i = 1}^{n} V {x_{i}} = \frac{V {X}}{n} \end{aligned} \end{matrix}$

10.1.3 Sample distribution

Proposition 10.1 Let’s consider a sample $X_{n}$ of $n$ IID random variables. If $n$ is sufficiently large, independently from the distribution of the $X$ , by the central limit theorem (CLT) the distribution of the sample expectation converges to the distribution of a normal random variable, i.e. $\hat{μ} (X_{n}) \underset{n \to \infty}{\overset{d}{⟶}} N (E {X}, \frac{V {X}}{n}) .$

Proof: Distribution of sample expectation (Proposition 10.1)

Proof. In order to prove Proposition 10.1 it is useful to compute the expectation and the variance of the following random variable, i.e. $S_{n} = \sum_{i = 1}^{n} x_{i} .$ The expectation and the variance of $S_{n}$ can be easily obtained from Equation 10.1 and Equation 10.2 respectively and read: $E {S_{n}} = n \cdot E {X} V {S_{n}} = n \cdot V {X}$ Applying the central limit theorem (Theorem 8.1) one obtain: $\frac{S_{n} - n \cdot E {X}}{\sqrt{n} \cdot S d {X}} = \frac{\frac{S_{n}}{n} - E {X}}{\frac{S d {X}}{\sqrt{n}}} \sim N (0, 1) .$ Hence the random variable mean $\hat{μ} (X_{n}) = \frac{S_{n}}{n}$ on large samples is distributed as a normal random variable, i.e. $\hat{μ} (X_{n}) = \frac{S_{n}}{n} = \frac{1}{n} \sum_{i = 1}^{n} x_{i} \underset{n \to \infty}{\overset{d}{⟶}} N (E {X}, \frac{V {X}}{n}) .$ Note that on small sample this results holds true if and only if $X$ is normally distributed also in population. Under normality also in population we have that independently from the sample size: $X_{i} \sim N (E {X}, V {X}), \forall i ⟹ \hat{μ} (X_{n}) \sim N (E {X}, \frac{V {X}}{n}) .$

Distribution of sample mean

# True population moments  
true <- c(e_x = 1, v_x = 2)
# Number of elements for large samples
n <- 5000
# Number of elements for small samples
n_small <- trunc(n/30)
# Number of sample to simulate 
n_sample <- 2000

# Simulation of sample means
stat_sample_small <- c()
stat_sample_large <- c()
for(i in 1:n_sample){
  set.seed(i)
  # Large sample 
  x_n <- true[1] +  sqrt(true[2])*rnorm(n)
  # Statistic 
  stat_sample_large[i] <- mean(x_n)
  # Small sample 
  x_n <- x_n[1:n_small]
  # Statistic 
  stat_sample_small[i] <- mean(x_n)
}

10.2 Variance and covariance

In general the variance of a random variable in population defined as: $V {X} = E {{(X - E {X})}^{2}} .$ Let’s consider a discrete random variable $X$ with distribution function $P (X = x_{j}) = p_{j}$ . Then the variance of $X$ is the weighted average between all the possible $m$ -centered and squared states that the random variable can assume by it’s respective probability of occurrence, i.e. $V {X} = \sum_{j = 1}^{m} (x_{j} - E {X})^{2} p_{j} .$ In the continuous case, i.e. when $X$ admits a density function and takes values in $R$ , the expectation is computed as: $V {X} = \int_{- \infty}^{\infty} (x - E {X})^{2} f_{X} (x) d x .$ Let’s consider two random variables $X$ and $Y$ . Then, in general their covariance is defined as: $C v {X, Y} = E {(X - E {X}) (Y - E {Y})} .$ In the discrete case where $X$ and $Y$ have a joint distribution $P (X = x_{i}, Y = y_{j}) = p_{i j}$ , their covariance is defined as: $C v {X, Y} = \sum_{i = 1}^{m} \sum_{j = 1}^{s} (x_{i} - E {X}) (y_{j} - E {Y}) p_{i j} .$ In the continuous case, if the joint distribution of $X$ and $Y$ admits a density function the covariance is computed as: $C v {X, Y} = \int_{- \infty}^{\infty} \int_{- \infty}^{\infty} (x - E {X}) (y - E {Y}) f_{X, Y} (x, y) d x d y .$

10.2.1 Properties

There are several properties connected to the variance.

The variance can be computed as: $\begin{matrix} (10.3) & V {X} = E {X^{2}} - E {X}^{2} . \end{matrix}$
The variance is invariant with respect to the addition of a constant $a$ , i.e. $\begin{matrix} (10.4) & V {a + X} = V {X} . \end{matrix}$
The variance scales upon multiplication with a constant $a$ , i.e. $\begin{matrix} (10.5) & V {a X} = a^{2} V {X} . \end{matrix}$
The variance of the sum is computed as: $\begin{matrix} (10.6) & V {X + Y} = V {X} + V {Y} + 2 C v {X, Y} . \end{matrix}$
The covariance can be expressed as:
$\begin{matrix} (10.7) & C v {X, Y} = E {X Y} - E {X} E {Y} . \end{matrix}$
The covariance scales upon multiplication with a constant $a$ and $b$ , i.e. $\begin{matrix} (10.8) & C v {a X, b Y} = a b C {X, Y} . \end{matrix}$

Proof: Properties of the variance

Proof. The property 1. (Equation 10.3) follows easily developing the definition of variance, i.e. $\begin{aligned} V {X} & = E {(X - E {X})^{2}} = \\ = E {X^{2}} + E {X}^{2} - 2 E {X}^{2} = \\ = E {X^{2}} - E {X}^{2} \end{aligned}$ The property 2. (Equation 10.4) follows from the definition, i.e. $\begin{aligned} V {a + X} & = E {(a + X - E {a + X})^{2}} = \\ = E {(X - E {X})^{2}} = \\ = V {X} \end{aligned}$ The property 3. (Equation 10.5) follows using the expression of the variance in Equation 10.3, i.e. $\begin{aligned} V {a X} & = E {(a X)^{2}} - E {a X}^{2} = \\ = a^{2} E {X^{2}} - a^{2} E {X}^{2} = \\ = a^{2} (E {X^{2}} - E {X}^{2}) = \\ = a^{2} V {X} \end{aligned}$ The property 4. (Equation 10.6), i.e. the variance of the sum of two random variables is: $\begin{aligned} V {X + Y} & = E {(X + Y - E {X + Y})^{2}} = \\ = E {([X - E {X}] + [Y - E {Y}])^{2}} = \\ = E {(X - E {X})^{2}} + E {(Y - E {Y})^{2}} + 2 E {(X - E {X}) (Y - E {Y})} = \\ = V {X} + V {Y} + 2 C v {X, Y} \end{aligned}$ where in the case in which there is no linear connection between $X$ and $Y$ the covariance is zero, i.e. $C v {X, Y} = 0$ . Developing the computation of the covariance it is possible to prove property 5. (Equation 10.7), i.e. $\begin{aligned} C v {X, Y} & = E {(X - E {X}) (Y - E {Y})} = \\ = E {X Y - X E {Y} - Y E {X} + E {X} E {Y}} = \\ = E {X Y} - 2 E {X} E {Y} + E {X} E {Y} = \\ = E {X Y} - E {X} E {Y} \end{aligned}$ Finally, using the result in property 5. (Equation 10.7) the result in property 6. (Equation 10.8) follows easily: $\begin{aligned} C v {a X, b Y} & = E {a X b Y} - E {a X} E {b Y} = \\ = a b E {X Y} - a b E {X} E {Y} = \\ = a b C v {X, Y} \end{aligned}$

10.2.2 Conditional variance

Proposition 10.2 ( $Conditional variance$ )
Let’s consider two random variable $X$ and $Y$ with finite second moment. Then, the total variance can be expressed as: $V {X} = E {V {X ∣ Y}} + V {E {X ∣ Y}} ⟺ V {Y} = E {V {Y ∣ X}} + V {E {Y ∣ X}}$

Proof: Proposition 10.2

Proof. By definition, the variance of a random variable reads: $V {X} = E {X^{2}} - E {X}^{2} .$ Applying the tower property we can write $V {X} = E {E {X^{2} ∣ Y}} - E {E {X ∣ Y}}^{2} .$ Then, add and subtract $E {E {X ∣ Y}^{2}}$ , $V {X} = E {E {X^{2} ∣ Y}} - E {E {X ∣ Y}}^{2} + E {E {X ∣ Y}^{2}} - E {E {X ∣ Y}^{2}} .$ Grouping the first and fourth terms and the second and third one obtain $\begin{aligned} V {X} & = E {E {X^{2} ∣ Y} - E {X ∣ Y}^{2}} + E {E {X ∣ Y}^{2}} - E {E {X ∣ Y}}^{2} = \\ = E {V {X ∣ Y}} + V {E {X ∣ Y}} \end{aligned} .$

10.2.3 Sample statistic

The sample’s variance on $X_{n} = (x_{1}, \dots, x_{i}, \dots, x_{n})$ is computed as: $\begin{matrix} (10.9) & V {X_{n}} = {\hat{σ}}^{2} (X_{n}) = \frac{1}{n} \sum_{i = 1}^{n} {(x_{i} - E {X_{n}})}^{2} . \end{matrix}$ Equivalently, in terms of the first and second moment: $\begin{matrix} (10.10) & {\hat{σ}}^{2} (X_{n}) = \frac{1}{n} \sum_{i = 1}^{n} x_{i}^{2} - {(\frac{1}{n} \sum_{i = 1}^{n} x_{i})}^{2} . \end{matrix}$ In general, the variance computed as in Equation 10.9 is not correct for the population value. Hence, to correct the estimator let’s define the corrected sample’s variance: $\begin{matrix} (10.11) & {\hat{s}}^{2} (X_{n}) = \frac{n}{n - 1} {\hat{σ}}^{2} (X_{n}) . \end{matrix}$

10.2.4 Sample moments

Let’s consider an the moments of the sample variance on an IID sample. The expected value of the corrected sample variance: $\begin{matrix} (10.12) & E {{\hat{s}}^{2} (X_{n})} = σ^{2} . \end{matrix}$ The variance of the corrected sample variance is: $\begin{matrix} (10.13) & V {{\hat{s}}^{2} (X_{n})} = \frac{σ^{4}}{n} ((\frac{μ_{4}}{σ^{4}} - 3) + 2 \frac{n}{n - 1}), \end{matrix}$ where $\frac{μ_{4}}{σ^{4}}$ is the kurtosis of $X_{n}$ . If the population is normal, $\frac{μ_{4}}{σ^{4}} = 3$ and the variance simplifies in: $\begin{matrix} (10.14) & V {{\hat{s}}^{2} (X_{n})} = \frac{2 σ^{4}}{n - 1} . \end{matrix}$

10.2.5 Sample distribution

The distribution of the sample variance is available when we consider the sum of n-IID standard normal random variables. Notably, from Cochran’s theorem:
$\begin{matrix} (10.15) & T_{n} = (n - 1) \frac{{\hat{s}}^{2} (X_{n})}{σ^{2}} \sim χ^{2} (n - 1) \end{matrix}$ Going to the limit as $ν \to \infty$ a $χ_{ν}^{2}$ random variable converges to a standard normal random variable, i.e. $\frac{χ^{2} (n) - n}{\sqrt{2 n}} \underset{n \to \infty}{\overset{d}{⟶}} N (0, 1)$ therefore, on large samples the statistic $T_{n}$ converges to a normal random variable, i.e. $\begin{matrix} (10.16) & T_{n} \underset{n \to \infty}{\overset{d}{⟶}} N (n, 2 n) ⟺ \frac{T_{n} - n}{\sqrt{2 n}} \sim N (0, 1) \end{matrix}$

Distribution of

{\hat{s}}^{2} (X_{n})

under normality.

If the population $X_{n}$ is normal, then the distribution of ${\hat{s}}^{2} (X_{n})$ is proportional to the distribution of a $χ_{n - 1}^{2}$ . In fact, from Equation 10.15 the expectation of ${\hat{s}}^{2} (X_{n})$ is: $\begin{aligned} E {T_{n}} & = (n - 1) \frac{E {{\hat{s}}^{2} (X_{n})}}{σ^{2}} \\ ⟹ E {{\hat{s}}^{2} (X_{n})} = \frac{σ^{2} E {T_{n}}}{n - 1} = \frac{σ^{2} (n - 1)}{n - 1} \\ ⟹ E {{\hat{s}}^{2} (X_{n})} = σ^{2} \end{aligned}$ Similarly, computing the variance of Equation 10.15 and knowing that $V {T_{n}} = 2 (n - 1)$ one obtain: $\begin{aligned} V {T_{n}} & = (n - 1)^{2} \frac{V {{\hat{s}}^{2} (X_{n})}}{σ^{4}} \\ ⟹ V {{\hat{s}}^{2} (X_{n})} = \frac{σ^{4} V {T_{n}}}{(n - 1)^{2}} = \frac{σ^{4} 2 (n - 1)}{(n - 1)^{2}} \\ ⟹ V {{\hat{s}}^{2} (X_{n})} = \frac{2 σ^{4}}{n - 1} \end{aligned}$

Distribution of

T_{n}

under normality

# True population moments  
true <- c(e_x = 1, v_x = 2)
# Number of elements for large samples
n <- 5000
# Number of elements for small samples
n_small <- trunc(n/30)
# Number of sample to simulate 
n_sample <- 2000

# Simulation of sample variance
stat_sample_small <- c()
stat_sample_large <- c()
for(i in 1:n_sample){
  set.seed(i)
  # Large sample 
  x_n <- true[1] + sqrt(true[2])*rnorm(n)
  # Statistic 
  stat_sample_large[i] <- (1/(n-1))*sum((x_n - mean(x_n))^2)
  # Small sample 
  x_n <- x_n[1:n_small]
  # Statistic 
  stat_sample_small[i] <- (1/(n_small-1))*sum((x_n - mean(x_n))^2)
}

10.3 Skewness

The skewness is a measure of the asymmetry of the probability distribution of a real-valued random variable about its mean. The skewness value can be positive, zero, negative, or undefined. For a uni modal distribution, negative skew commonly indicates that the tail is on the left side of the distribution, and positive skew indicates that the tail is on the right.

Following the same notation as in Ralph B. D’agostino and Jr. (1990), let’s define and denote the population skewness of a random variable $X$ as: $S k {X} = β_{1} (X) = E {{(\frac{X - E {X}}{\sqrt{V {X}}})}^{3}},$

10.3.1 Sample statistic

Let’s consider an IID sample $X_{n} = (x_{1}, \dots, x_{i}, \dots, x_{n})$ , then the sample’s skewness is estimated as: $\begin{matrix} (10.17) & S k {X_{n}} = b_{1} (X_{n}) = \frac{1}{n} \sum_{i = 1}^{n} {(\frac{x_{i} - E {X_{n}}}{\sqrt{V {X_{n}}}})}^{3} . \end{matrix}$ The estimator in Equation 10.17 is not correct. Hence, let’s define the correct sample estimator of the skewness as: $g_{1} (X_{n}) = \frac{\sqrt{n (n - 1)}}{(n - 2)} b_{1} (X_{n}) .$

10.3.2 Sample moments

Under normality, the asymptotic moments of the sample skewness are: $E {b_{1} (X_{n})} = 0, V {b_{1} (X_{n})} = \frac{6}{n} .$ In Urzúa (1996) are also reported the exact mean of the estimator in Equation 10.17 for small normal samples, i.e.
$E {b_{1} (X_{n})} = 0,$ and variance $\begin{matrix} (10.18) & V {b_{1} (X_{n})} = \frac{6 (n - 2)}{(n + 1) (n + 3)} . \end{matrix}$

10.3.3 Sample distribution

Under normality, the asymptotic distribution of the sample skewness is normal i.e. $\begin{matrix} (10.19) & b_{1} (X_{n}) \underset{n \to \infty}{\overset{d}{⟶}} N (0, \frac{6}{n}) . \end{matrix}$

10.4 Kurtosis

The kurtosis is a measure of the tailedness of the probability distribution of a real-valued random variable. The standard measure of a distribution’s kurtosis, originating with Karl Pearson is a scaled version of the fourth moment of the distribution. This number is related to the tails of the distribution. For this measure, higher kurtosis corresponds to greater extremity of deviations from the mean (or outliers). In general, it is common to compare the excess kurtosis of a distribution with respect to the normal distribution (with kurtosis equal to 3). It is possible to distinguish 3 cases:

A negative excess kurtosis or platykurtic are distributions that produces less outliers than the normal. distribution.
A zero excess kurtosis or mesokurtic are distributions that produces same outliers than the normal.
A positive excess kurtosis or leptokurtic are distributions that produces more outliers than the normal.

Figure 10.4: Kurtosis of a different leptokurtic distributions.

Let’s define and denote the population kurtosis of a random variable $X$ as: $K t {X} = β_{2} (X) = E {{(\frac{X - E {X}}{\sqrt{V {X}}})}^{4}},$ or equivalently the excess kurtosis as $K t {X} - 3$ .

10.4.1 Sample statistic

Let’s consider an IID sample $X_{n} = (x_{1}, \dots, x_{i}, \dots, x_{n})$ , then the sample’s kurtosis is denoted as: $\begin{matrix} (10.20) & K t {X_{n}} = b_{2} (X_{n}) = \frac{1}{n} \sum_{i = 1}^{n} {(\frac{x_{i} - E {X_{n}}}{\sqrt{V {X_{n}}}})}^{4} . \end{matrix}$ From Pearson (1931), we have a correct the version of $b_{1} (X_{n})$ defined as: $g_{2} (X_{n}) = [b_{2} (X_{n}) - \frac{3 (n + 1)}{n + 1}] \frac{(n + 1) (n - 1)}{(n - 2) (n - 3)} .$

10.4.2 Sample moments

Under normality, the asymptotic moments of the sample kurtosis are: $E {b_{2} (X_{n})} = 3, V {b_{2} (X_{n})} = \frac{24}{n} .$ Notably in Urzúa (1996) are reported also the exact mean and variance for a small normal sample, i.e. $\begin{matrix} (10.21) & E {b_{2} (X_{n})} = \frac{3 (n - 1)}{(n + 1)}, \end{matrix}$ and the variance as: $\begin{matrix} (10.22) & V {b_{2} (X_{n})} = \frac{24 n (n - 2) (n - 3)}{(n + 1)^{2} (n + 3) (n + 5)} . \end{matrix}$

10.4.3 Sample distribution

Under normality, the asymptotic distribution of the sample kurtosis is normal, i.e. $\begin{matrix} (10.23) & b_{2} (X_{n}) \underset{n \to \infty}{\overset{d}{⟶}} N (3, \frac{24}{n}) . \end{matrix}$