9 Introduction

9.1 Population and Sample

A population refers to the entire group of individuals or instances about whom we hope to learn. It encompasses all possible subjects or observations that meet a set of criteria. The population is the complete set of items that interest the researcher, and it can be finite (e.g. the students in a particular school) or infinite (e.g. the number of times a die can be rolled). A population size is given by the number of distinct elements and it includes every individual or observation of interest.

A sample is a subset of the population that is used to represent the population. Since studying an entire population is often impractical due to constraints like time, cost, and accessibility, samples provide a manageable and efficient way to gather data and make inferences about the population. It is important that the sample is representative of the population of interest to allow for valid inferences. It is always important to distinguish between a random sample, e.g. a random group of students in 5th year from a school to make inference about the students at the 5th year of such school, and a convenience sample, e.g. a class of 5th year students who are easily accessible to the researcher, but that can be not representative of all the 5th year students in the school.

Aspect	Population	Sample
Definition	Entire group of interest	Subset of the population
Size	Large, potentially infinite	Small, manageable
Data Collection	Often impractical to study directly	Practical and feasible
Purpose	To understand the whole group	To make inferences about the population

Population vs sample

Any finite sample is a collection of realization of a finite number of random variables. For example, consider a set of random vector $X_{n}$ , namely a sequence of $n$ random variables $X_{n} = (X_{1}, \dots, X_{n})$ . Let’s now consider a possible realization of this random sample, namely $x_{n} = (x_{1}, \dots, x_{n})$ . While $X_{n}$ denotes a random variable with distributive law given by the joint distribution of $(X_{1}, \dots, X_{n})$ , $x_{n}$ represents just one of the possible realization of $X_{n}$ .

In general, we distinguish between finite or non-finite populations. In the case of a finite population with $N$ element we distinguish between extractions:

With reimmission of $n$ elements for the sample gives $N^{n}$ possible combinations.
Without readmission of $n$ elements for the sample gives $(\binom{N}{n})$ possible combinations.

9.2 Estimators

Let’s consider a statistical model depending on some unknown parameter $θ$ contained in the sample space of parameters $Θ$ , i.e. ${P_{θ} : θ \in Θ}$ . Then, given an observed sample from the statistical model, $x_{n} = (x_{1}, \dots, x_{n})$ , an estimator is a function that maps the sample space to a set of sample estimates. Formally, since $X_{n} = (X_{1}, \dots, X_{n})$ is a collection of random variables, then any function of the sample, like the estimator of $θ$ , is a random variable, i.e. $X_{n} ⟶ x_{n} ⟶ θ (x_{n}),$ where the random variables in $X_{n}$ generates a sample $x_{n}$ that is the input of the estimator function $θ$ that output an estimate $θ (x_{n})$ . When we condition to a particular value of the sample $x_{n}$ , we obtain a point estimate of the true $θ$ , i.e. $θ (x_{n}) = \hat{θ}$ , that is a number (or a vector).

Since the estimator is a random variable itself, one can define some metrics to compare different estimators of the same parameter. Firstly, let’s consider the bias, the distance between the average of the collection of estimates and the single parameter being estimated, i.e. $Bias {θ (X_{n})} = E {θ (X_{n})} - θ .$ We distinguish between two kind of estimators biased, when $\begin{matrix} (9.1) & θ (X_{n}) biased ⟺ Bias {θ (X_{n})} \neq 0, \end{matrix}$ and unbiased, when $\begin{matrix} (9.2) & θ (X_{n}) unbiased ⟺ Bias {θ (X_{n})} = 0 . \end{matrix}$ The variance is used to indicate how far the collection of estimates are from the expected value of the estimates, i.e. $V {θ (X_{n})} = E {{(θ (X_{n}) - E {θ (X_{n})})}^{2}} .$ Finally, the Mean Squared Error (MSE) of an estimator, i.e. $MSE {θ (X_{n})} = Bias {θ (X_{n})}^{2} + V {θ (X_{n})},$ where for an unbiased estimator, the mean squared error equals the variance.

9.2.1 Properties

As for some desirable property of an estimator one have

Consistency: An estimator $θ (X_{n})$ is said to be consistent for the parameter $θ$ if it converges in probability (Definition 8.3) to the true parameter $θ$ as the sample size $n \to \infty$ : $θ (X_{n}) \underset{n \to \infty}{\overset{P}{⟶}} θ .$ Intuitively, as the sample grows, the estimator becomes arbitrarily close to the true population’s parameter with high probability, formally for every $ε > 0$ , $lim_{n \to \infty} P {ω \in Ω : | θ (X_{n}) - θ | > ϵ} = 0 .$ In general, a consistent estimator is a minimal requirement in statistics: without it, the estimator does not converge to the true value even with infinite data. In fact, consistency ensures that the estimator is learning from the data. If we could repeat the estimation process many times with increasing $n$ , the histogram of $\hat{θ} (x_{n})$ would peak more and more tightly around the true $θ$ . Let’s explore in Example 9.1 the bias and consistency properties of three different estimators of the true expected value of an IID sample.

Example: consistency of an estimator

Example 9.1 Let’s consider an IID normally distributed sample, i.e. $X_{1}, \dots, X_{n} \overset{i.i.d.}{\sim} N (μ, σ^{2})$ and let’s consider different estimators for the mean $μ$ .

A. Sample mean: The natural estimator for $μ$ under Normality is the sample mean, i.e. $\hat{μ} (x_{n}) = \frac{1}{n} \sum_{i = 1}^{n} x_{i} .$ that the strong law of large numbers (Proposition 8.1) converges almost surely, and therefore in probability, to the true parameter in population, i.e. $\hat{μ} (X_{n}) \overset{a.s.}{⟶} μ .$ Thus $\hat{μ}$ is a consistent estimator for the true expectation $μ$ . Hence, with more and more samples, the average of the observations will get closer to the true population mean. If we plot $\hat{μ}$ against $n$ , the fluctuations shrink around the true $μ$ .

B. Biased but consistent estimator. Suppose we estimate $μ$ using an estimator of the form $\tilde{μ} (x_{n}) = \frac{1}{n + 1} \sum_{i = 1}^{n} x_{i} .$ This estimator is biased, since $E {\tilde{μ} (X_{n})} = \frac{n}{n + 1} μ \neq μ .$ However, as $n \to \infty$ , the estimate converges to the true parameter in population, i.e. $lim_{n \to \infty} \frac{n}{n + 1} = 1 ⟹ \frac{n}{n + 1} μ \to μ,$ and the variance still shrinks like $1 / n$ .

Even if ${\tilde{μ}}_{n}$ is biased, it can be consistent. Hence, bias does not necessarily prevent consistency if the bias disappears as $n$ grows. In fact, small-sample bias can coexist with asymptotic correctness.

C. Inconsistent estimator: Let’s define as estimator for $μ$ the value of the first observation, i.e. $\underset{―}{μ} (x_{n}) = x_{1} .$ Clearly, $\underset{―}{μ} (X_{n}) \sim N (μ, σ^{2})$ for all $n$ , so its distribution never concentrates around $μ$ . Formally, $P (| \underset{―}{μ} (X_{n}) - μ | > ε) = P (| X_{1} - μ | > ε) \neq 0 .$ Thus, $\underset{―}{μ}$ is unbiased, but not consistent. Hence, using only one observation, regardless of sample size, ignores the information contained in the data. The variance of the estimator does not reduce with larger $n$ .

Example: consistency of the sample mean

library(dplyr)
set.seed(1)
# =======================================
#                Inputs 
# =======================================
# True parameters
par <- c(mu = 2, sigma = 2)
# True moments (normal)
moments <- c(par[1], par[2])
# Observations for each sample
n <- c(5, 10, 100, 200, 300, 400, 500)
# Number of samples 
n.sample <- 1000
# Confidence interval 
alpha <- 0.005
# =======================================
# Consistency function 
example_consistency <- function(n = 100, n.sample = 1000, par, moments, alpha = 0.005){
  estimates <- list()
  for(i in 1:n.sample){
    x_n <- rnorm(n, par[1], par[2]) 
    # Case A
    mu_A <- sum(x_n) / n
    # Case B
    mu_B <- mu_A / (n+1) * n
    # Case C
    mu_C <- x_n[1]
    estimates[[i]] <- dplyr::tibble(i = i, n = n,
                                    A = mu_A, 
                                    B = mu_B, 
                                    C = mu_C, 
                                    dw = qnorm(alpha, moments[1], moments[2]/sqrt(n)),
                                    up = qnorm(1-alpha, moments[1], moments[2]/sqrt(n)))
  }
  dplyr::bind_rows(estimates) %>%
  group_by(n) %>%
  mutate(e_A = mean(A), q_dw_A = quantile(A, alpha), q_up_A = quantile(A, 1 - alpha),
         e_B = mean(B), q_dw_B = quantile(B, alpha), q_up_B = quantile(B, 1 - alpha),
         e_C = mean(C), q_dw_C = quantile(C, alpha), q_up_C = quantile(C, 1 - alpha))
}
# =======================================
# Generate data 
data <- purrr::map_df(n, ~example_consistency(.x, n.sample, par, moments, alpha))

Code

library(ggplot2)
fig_A <- ggplot(data)+
  geom_point(aes(n, A))+
  geom_line(aes(n, moments[1], color = "theoric"))+
  geom_line(aes(n, e_A, color = "empiric"), linetype = "dashed")+
  geom_line(aes(n, dw, color = "theoric"))+
  geom_line(aes(n, up, color = "theoric"))+
  geom_line(aes(n, q_up_A, color = "empiric"), linetype = "dashed")+
  geom_line(aes(n, q_dw_A, color = "empiric"), linetype = "dashed")+
  theme_bw()+
  scale_color_manual(
    values = c(theoric = "red", empiric = "purple"),
    labels = c(theoric = "Theoric", empiric = "Empiric")
  )+
  theme(legend.position = "none")+
  labs(x = "", y = "", subtitle = "A", color = "")+
  scale_y_continuous(limits = range(c(data$A, data$B, data$C)))

fig_B <- ggplot(data)+
  geom_point(aes(n, B))+
  geom_line(aes(n, moments[1], color = "theoric"))+
  geom_line(aes(n, e_B, color = "empiric"), linetype = "dashed")+
  geom_line(aes(n, dw, color = "theoric"))+
  geom_line(aes(n, up, color = "theoric"))+
  geom_line(aes(n, q_up_B, color = "empiric"), linetype = "dashed")+
  geom_line(aes(n, q_dw_B, color = "empiric"), linetype = "dashed")+
  theme_bw()+
  scale_color_manual(
    values = c(theoric = "red", empiric = "purple"),
    labels = c(theoric = "Theoric", empiric = "Empiric")
  )+
  theme(legend.position = "none")+
  labs(x = "Sample size", y = "", subtitle = "B", color = "")+
  scale_y_continuous(limits = range(c(data$A, data$B, data$C)))

fig_C <- ggplot(data)+
  geom_point(aes(n, C))+
  geom_line(aes(n, moments[1], color = "theoric"))+
  geom_line(aes(n, e_C, color = "empiric"), linetype = "dashed")+
  geom_line(aes(n, dw, color = "theoric"))+
  geom_line(aes(n, up, color = "theoric"))+
  geom_line(aes(n, q_up_C, color = "empiric"), linetype = "dashed")+
  geom_line(aes(n, q_dw_C, color = "empiric"), linetype = "dashed")+
  theme_bw()+
  scale_color_manual(
    values = c(theoric = "red", empiric = "purple"),
    labels = c(theoric = "Theoric", empiric = "Empiric")
  )+
  theme(legend.position = "none")+
  labs(x = "", y = "", subtitle = "C", color = "")+
  scale_y_continuous(limits = range(c(data$A, data$B, data$C)))

gridExtra::grid.arrange(fig_A, fig_B, fig_C, nrow = 1)

Sample means computed on 1000 samples with different number of observations with estimators A, B and C. The theoric confidence interval and true expectation of $μ$ in red while in purple the empirical ones.

Efficiency: Among unbiased estimators, the one with minimal variance is called efficient; asymptotically efficient means it attains the Cramér–Rao lower bound (CRLB) limit.
Asymptotic normality: for some normalizing sequence $a_{n}$ depending on $n$ , the estimator converges in distribution (Definition 8.5) to a normal random variable, $a_{n} (θ (X_{n}) - θ) \underset{n \to \infty}{\overset{d}{⟶}} N (0, σ^{2} (θ)) .$

9.2.2 Sufficiency and Completeness

Theorem 9.1 ( $Factorization Theorem$ )
A statistic $θ (X_{n})$ is sufficient for the parameter $θ$ if and only if the joint density (or probability mass function) can be factorized as $f (X_{n} ∣ θ) = g (θ (X_{n}), θ) h (X_{n}),$ for some non-negative functions $g$ and $h$ , where $g$ depends on the data only through $θ (X_{n})$ and $θ$ , while $h$ does not depend on the parameter $θ$ .

In other words, a statistic is sufficient if it captures all the information in the sample about $θ$ . After observing a sufficient statistic, the sample provides no further information about the parameter $θ$ . Equivalently, the conditional distribution of the full data given the sufficient statistic does not depend on $θ$ . A natural consequence of sufficiency is that two samples with the same value for the sufficient statistic should result in the same inference.

Example: Sufficient statistic for a Bernoulli

Example 9.2 If $X_{n} \overset{iid}{\sim} Bernoulli (p)$ , the likelihood given $p$ is $f (X_{n} ∣ p) = p^{\sum_{i = 1}^{n} X_{i}} (1 - p)^{n - \sum_{i = 1}^{n} X_{i}} .$ Here $K (X_{n}) = \sum_{i = 1}^{n} X_{i}$ is sufficient for $p$ , since the likelihood factors as $g (K, p) = p^{K} (1 - p)^{n - K}$ with $h (X_{n}) = 1$ . Intuitively: for Bernoulli trials, only the total number of successes matters, not the specific order.

Theorem 9.2 ( $Rao–Blackwell$ )
If $θ (X_{n})$ is sufficient for $θ$ and $θ^{'} (X_{n})$ is any unbiased estimator of $θ$ , then the Rao–Blackwellized estimator $\begin{matrix} (9.3) & θ^{R B} (X_{n}) = E {θ^{'} (X_{n}) ∣ θ (X_{n})}, \end{matrix}$ is unbiased for $θ$ , i.e. $E {θ^{R B} (X_{n})} = θ$ and has a lower (or equal) variance than $θ^{'}$ $V {θ^{R B} (X_{n})} \leq V {θ^{'} (X_{n})} .$

In other words, conditioning on a sufficient statistic cannot increase variance. So Rao–Blackwell gives a systematic way to improve an estimator: start with any unbiased estimator, then condition it on a sufficient statistic to possibly reduced the variance.

Definition 9.1 ( $Completeness$ )
A statistic $θ (X_{n})$ is said to be complete if for any measurable function $g$ and for all $θ \in Θ$ , $E {g (θ (X_{n})) ∣ θ} = 0,$ implies that $g (θ (X_{n})) = 0,$ almost surely (Definition 8.2).

Intuition: if a function of a statistic always averages out to zero no matter what the parameter is, then it must be the trivial zero function. Completeness rules out the existence of “hidden functions” of the statistic that are unbiased estimators of $θ$ but nontrivial. It ensures uniqueness of unbiased estimators derived from the statistic.

Example: Complete and sufficient statistic for a Bernoulli

Example 9.3 Continuing from Example 9.2, the statistic $K (X_{n}) = \sum_{i = 1}^{n} X_{i}$ is not only sufficient but also complete for $p$ . In fact if a function $g$ satisfies $E {g (K (X_{n}))} = 0$ for all $p$ , then necessarily $g (.) = 0$ . Suppose there exists some function g such that for all $p \in (0, 1)$ $E {g (T)} = \sum_{t = 0}^{n} g (t) (\binom{n}{t}) p^{t} (1 - p)^{n - t} = 0 .$ The left-hand side is a polynomial in $p$ (because of the binomial expansion). If this polynomial is identically zero for all $p$ , then all coefficients must be zero and so each $g (t) = 0$ .

Theorem 9.3 ( $Lehmann–Scheffé Theorem$ )
If $θ (X_{n})$ is complete and sufficient estimator of $θ$ , and if $θ^{R B} (X_{n})$ , as defined in Equation 9.3, is unbiased for $θ$ , then $θ^{R B} (X_{n})$ is the unique Uniformly Minimum-Variance Unbiased Estimator (UMVU) of $θ$ .

In practice, sufficiency (Theorem 9.1) ensures $θ^{R B} (X_{n})$ uses all available information and completeness (Definition 9.1) ensures no other unbiased estimator can have the same mean with smaller variance. Thus, the Lehmann–Scheffé theorem identifies the best possible unbiased estimator.

Example: UMVU estimator of

p

for a Bernoulli

Example 9.4 Continuing from Example 9.3, $K (X_{n}) = \sum_{i = 1}^{n} X_{i}$ is complete and sufficient for $p$ . Thus, by Lehmann–Scheffé, the sample mean is an unbiased UMVU estimator of $p$ .