23 Hypothesis tests

Setup

library(dplyr)
library(ggplot2)
# ================== Setups ==================
n <- 500 # number of simulations 
set.seed(1) # random seed 
mu_0 <- 2.4 # H0 mean 
mu_true <- 2 # true mean
alpha <- 0.05 # confidence level
# ============================================
# Simulated random variable 
x <- rnorm(n, mean = mu_true, sd = 4)
# Grid of points for pdf
x_limits <- c(-4,4)
x_grid <- seq(x_limits[1], x_limits[2], by = 0.01)
x_breaks <- seq(x_limits[1], x_limits[2], by = 1)

A statistical hypothesis is a claim about the value of a parameter or population characteristic. In any hypothesis-testing problem, there are always two competing hypotheses under consideration

The null hypothesis $H_{0}$ representing the status quo.
The alternative hypothesis $H_{1}$ representing the research.

The objective of hypothesis testing is to decide, based on sample information, if the alternative hypotheses is actually supported by the data. One usually do new research to challenge the existing beliefs.

Is there strong evidence for the alternative?

Let’s consider that you want to establish if the null hypothesis $H_{0}$ is not supported by the data. One usually assume to work under $H_{0}$ , then if the sample does not strongly contradict H0, we will continue to believe in the plausibility of the null hypothesis. There are only two possible conclusions: Reject $H_{0}$ or Fail to reject $H_{0}$ .

Definition 23.1 The test statistic $T (X_{n})$ is a function of a sample $X_{n}$ and is used to make a decision about whether the null hypothesis should be rejected or not. In theory, there are an infinite number of possible tests that could be devised. The choice of a particular test procedure must be based on the probability the test will produce incorrect results. In general, two kind of errors are related with test statistics, i.e.

A type I error is when the null hypothesis is rejected, but it is true.
A type II error is not rejecting the null when it is false.

The p-value is in general related to the probability of the type I error. So, the smaller the P-value, the more evidence there is in the sample data against the null hypothesis and for the alternative hypothesis.

In general, before performing a test one establish a significance level $α$ (the desired type I error probability), that defines the rejection region. Then the decision rule is: $\begin{aligned} Reject H_{0} & ⟺ p-value \leq α \\ Do not reject H_{0} & ⟺ p-value > α \end{aligned}$ The p-value can be thought of as the smallest significance level at which $H_{0}$ can be rejected and the calculation of the P-value depends on whether the test is upper, lower, or two-tailed.

For example, let’s consider a sample $X_{n}$ of data. Then, a statistical test consists of the following:

an assumption about the distribution of the data, often expressed in terms of a statistical model $M$ ;
a null hypothesis $H_{0}$ and an alternative hypothesis $H_{1}$ which make specific statements about the data;
a test statistic $T (X_{n})$ which is a function of the data and whose distribution under the null hypothesis is known;
a significance level $α$ which imposes an upper bound on the probability of rejecting $H_{0}$ , given that $H_{0}$ is true.

The general procedure for a statistical hypothesis test can be summarized as follows:

Inputs: consider a null hypothesis $H_{0}$ and the significance level $α$ .
Critical value: compute the value $t_{α}$ that determine the partitions the set of possible values of $T (X_{n})$ into rejection and non rejection regions.
Output: compare the observed test statistic $T (X_{n})$ computed on the sample with the critical value $t_{α}$ . If it is in the rejection region, $H_{0}$ is rejected in favor of $H_{1}$ . Otherwise, the test fails to reject $H_{0}$ .

Step	Description
Inputs	$H_{0}$ , $α$ .
Critical value	Critical level $t (α)$
Output	Rejection or not depending on $T (X_{n})$

In general, two kind of tests are available:

A two-tailed test is appropriate if the estimated value is greater or less than a certain range of values, for example, whether a test taker may score above or below a specific range of scores.
A one-tailed test is appropriate if the estimated value may depart from the reference value in only one direction, left or right, but not both.

23.1 Left and right tailed tests

For example, let’s simulate a sample $X_{n}$ of $n = 500$ observations from a normal distribution (i.e. $X_{n} \sim N (2, 4^{2})$ ) and consider the following sets of hypothesis, i.e. $H_{0} : μ (X) = 2.4 H_{1} : μ (X) \neq 2.4$ The statistic test is defined as $T (X_{n}) = \sqrt{500} \cdot \frac{μ (X_{n}) - 2.4}{σ (X_{n})} \overset{H_{0}}{\sim} t (499) .$ Since it is a two-tailed test the critical value for a significance level $α$ , denoted as $t_{α}$ , is such that: $\begin{aligned} α & = P ([T (X_{n}) < - t_{α / 2}] \cup [T (X_{n}) > t_{α / 2}]) \\ ⇕ \\ t_{α / 2} & = P^{- 1} (P (T (X_{n}) > t_{α / 2})), \end{aligned}$ where $P^{- 1}$ and $P$ are respectively the quantile and distribution functions of a Student- $t$ . If the statistic test $| T (X_{n}) | > | t_{α / 2} |$ , then we reject $H_{0}$ and so the mean of the sample is significantly different from 2.4. More precisely, with $α = 0.05$ , the critical value of a Student-t with 499 degrees of freedom is $t_{α / 2} = 1.9604$ .

Two-tailed test

# Statistic T
z <- sqrt(n)*(mean(x) - mu_0)/sd(x)
# Student-t density
pdf <- dt(x_grid, df = n-1)
# Critical value left 
z_left <- c(qt(alpha/2, df = n-1), dt(qt(alpha/2, df = n-1), df = n-1))
# Critical value right 
z_right <- c(qt(1-alpha/2, df = n-1), dt(qt(1-alpha/2, df = n-1), df = n-1))

Plot t-test

# Area left tail 
x_left <- x_grid[x_grid < z_left[1]]
y_left <- dt(x_left, df = n-1)
# Area right tail 
x_right <- x_grid[x_grid > z_right[1]]
y_right <- dt(x_right, df = n-1)
# Central area
x_centre <- x_grid[x_grid > z_left[1] & x_grid < z_right[1]]
y_centre <- dt(x_centre, df = n-1)
ggplot()+
  geom_segment(aes(x = z_left[1], xend = z_left[1], y = 0, yend = z_left[2]), color = "red")+
  geom_segment(aes(x = z_right[1], xend = z_right[1], y = 0, yend = z_right[2]), color = "red")+
  geom_ribbon(aes(x = x_centre, ymin = 0, ymax = y_centre, fill = "norej"), alpha = 0.3)+
  geom_ribbon(aes(x = x_left, ymin = 0, ymax = y_left, fill = "rej"), alpha = 0.3)+
  geom_ribbon(aes(x = x_right, ymin = 0, ymax = y_right, fill = "rej"), alpha = 0.3)+
  geom_line(aes(x_grid, pdf))+
  geom_point(aes(z, 0), color = "black")+
  scale_fill_manual(values = c(rej = "red", norej = "green"), 
                    labels = c(rej = "Rejection", norej = "No rejection")) + 
  scale_x_continuous(breaks = x_breaks) +
  labs(y = "", x = "x", fill = NULL)+
  theme_bw()+
  theme(
    legend.position = c(.95, .95),
    legend.justification = c("right", "top"),
    legend.box.just = "right",
    legend.margin = margin(6, 6, 6, 6),
    panel.grid = element_blank())

Figure 23.1: Two-tailed test on the mean.

Let’s consider another kind of hypothesis, $H_{0} : μ (X) \geq 2.4 H_{1} : μ (X) < 2.4$ The statistic test $T (X_{n})$ do not changes, however the null hypothesis implies a left-tailed test. Hence, the critical value is $t_{α}$ is such that $P (x < t_{α}) = 0.05$ . Applying the quantile function $P^{- 1}$ of a student- $t$ we obtain: $\begin{aligned} α & = P (T (X_{n}) < t_{α}) \\ ⇕ \\ t_{α} & = P^{- 1} (P (T (X_{n}) < t_{α})), \end{aligned}$ where $P^{- 1}$ and $P$ are respectively the quantile and distribution functions of a Student- $t$ . In this case, with $α = 0.05$ , the critical value of a Student-t with 499 degrees of freedom is $t_{α / 2} = - 1.6451$ . Therefore, if $T (X_{n}) < - 1.6451$ we do not reject the null hypothesis, i.e. $μ (X_{n})$ is greater than $μ_{0}$ , otherwise we reject it and $μ (X_{n})$ is lower than $μ_{0}$ .

Left-tailed test

# Critical value left 
z_left <- c(qt(alpha, df = n-1), dt(qt(alpha, df = n-1), df = n-1))
# Area left tail 
x_left <- x_grid[x_grid < z_left[1]]
y_left <- dt(x_left, df = n-1)
# Area right tail 
x_right <- x_grid[x_grid > z_left[1]]
y_right <- dt(x_right, df = n-1)
ggplot()+
  geom_segment(aes(x = z_left[1], xend = z_left[1], y = 0, yend = z_left[2]), color = "red")+
  geom_ribbon(aes(x = x_left, ymin = 0, ymax = y_left, fill = "rej"), alpha = 0.3)+
  geom_ribbon(aes(x = x_right, ymin = 0, ymax = y_right, fill = "norej"), alpha = 0.3)+
  geom_line(aes(x_grid, pdf))+
  geom_point(aes(z, 0), color = "black")+
  scale_fill_manual(values = c(rej = "red", norej = "green"), 
                    labels = c(rej = "Rejection", norej = "No rejection")) + 
  scale_x_continuous(breaks = x_breaks) +
  labs(y = "", x = "x", fill = NULL)+
  theme_bw()+
  theme_bw()+
  theme(
    legend.position = c(.95, .95),
    legend.justification = c("right", "top"),
    legend.box.just = "right",
    legend.margin = margin(6, 6, 6, 6),
    panel.grid = element_blank())

In this case we reject the null hypothesis, hence $μ (X_{n})$ is lower than $μ_{0}$ .

Lastly, let’s consider the right-tailed case, i.e. $H_{0} : μ (X) \leq 2.4 H_{1} : μ (X) > 2.4$ It is always one-side test, but in this case is right-tailed. Hence, the critical value $t_{α}$ is such that $\begin{aligned} 1 - & α = P (T (X_{n}) < t_{α}) \\ ⇕ \\ t_{α} & = P^{- 1} (P (T (X_{n}) < t_{α})), \end{aligned}$ where $P^{- 1}$ and $P$ are respectively the quantile and distribution functions of a Student- $t$ . In this case, with $α = 0.05$ , the critical value of a Student-t with 499 degrees of freedom is $t_{α / 2} = 1.6451$ . Therefore, if $T (X_{n}) < 1.6451$ we do not reject the null hypothesis, i.e. $μ (X_{n})$ is lower than $μ_{0}$ , otherwise we reject it and $μ (X_{n})$ is greater than $μ_{0}$ . Coherently with the previous test performed in Figure 23.2, a right railed test is not rejected in Figure 23.3, hence $μ (X_{n})$ is lower than $μ_{0} = 2.4$ .

Right-tailed test

# Critical value right 
z_right <- c(qt(1-alpha, df = n-1), dt(qt(1-alpha, df = n-1), df = n-1))
# Area right tail 
x_left <- x_grid[x_grid > z_right[1]]
y_left <- dt(x_left, df = n-1)
# Area right tail 
x_right <- x_grid[x_grid < z_right[1]]
y_right <- dt(x_right, df = n-1)
ggplot()+
  geom_segment(aes(x = x_left[1], xend = x_left[1], y = 0, yend = z_right[2]), color = "red")+
  geom_ribbon(aes(x = x_left, ymin = 0, ymax = y_left, fill = "rej"), alpha = 0.3)+
  geom_ribbon(aes(x = x_right, ymin = 0, ymax = y_right, fill = "norej"), alpha = 0.3)+
  geom_line(aes(x_grid, pdf))+
  geom_point(aes(z, 0), color = "black")+
  scale_fill_manual(values = c(rej = "red", norej = "green"), 
                    labels = c(rej = "Rejection area", norej = "Non rejection")) + 
  scale_x_continuous(breaks = x_breaks) +
  labs(y = "", x = "x", fill = NULL)+
  theme_bw()+
  theme(
    legend.position = c(.95, .95),
    legend.justification = c("right", "top"),
    legend.box.just = "right",
    legend.margin = margin(6, 6, 6, 6),
    panel.grid = element_blank())

23.2 Tests for the means

Proposition 23.1 Let’s consider the $t$ -test for the mean of a sample of identically and normally distributed random variables $X_{n} = (x_{1}, \dots, x_{i}, \dots, x_{n})$ . Then the test statistic $T (X_{n})$ under $H_{0} : \hat{μ} (X_{n}) = μ_{0}$ is student-t distributed with $n - 1$ degrees of freedom., i.e. $T (X_{n}) = \frac{\hat{μ} (X_{n}) - μ_{0}}{\frac{\hat{s} (X_{n})}{\sqrt{n}}} \overset{H_{0}}{\sim} t_{n - 1},$ where $\hat{μ} (X_{n})$ is the sample mean $\hat{μ} (X_{n})$ and $\hat{σ} (X_{n})$ the corrected sample variance. Moreover, for $n \to \infty$ : $T (X_{n}) \overset{H_{0}}{\underset{n \to \infty}{⟶}} N (0, 1) .$

Proof: Proposition 23.1

Proof. In the sample is normally distributed, the sample mean is also normally distributed, i.e. $M = \sqrt{n} \frac{\hat{μ} (X_{n}) - μ_{0}}{σ} \sim N (0, 1) .$ Under normality the sample variance, that is a sum of the square of independent and normally distributed random variables, follows a $χ^{2}$ distribution with $n - 1$ degrees of freedom, i.e. $V = \frac{(n - 1) {\hat{s}}^{2} (X_{n})}{σ^{2}} \sim χ_{n - 1}^{2} .$ Notably, the ratio of a standard normal and a $χ^{2}$ random variables (each one divided by the respective degrees of freedom) is exactly the definition of a Student-t random variable as in Equation 33.2. Hence, the ratio between the statistics $M$ and $V$ divided by their degrees of freedom reads $\frac{M}{\sqrt{\frac{V}{n - 1}}} = \sqrt{n} \frac{\hat{μ} (X_{n}) - μ_{0}}{σ} \sqrt{\frac{σ^{2}}{{\hat{s}}^{2} (X_{n})}} = \sqrt{n} \frac{\hat{μ} (X_{n}) - μ_{0}}{{\hat{s}}^{2} (X_{n})} \sim t_{n - 1} .$ The statistic test under $H_{0}$ follows a Student-t distribution with $n - 1$ degrees of freedom. Notably, for large IID samples the statistic converges to a normal random variable independently from the distribution of $X$ .

23.2.1 Test for two means and equal variances

Let’s consider two independent Gaussian populations with equal variance, i.e. $X_{1} \sim N (μ_{1}, σ^{2}), X_{2} \sim N (μ_{2}, σ^{2})$ Then, let’s consider two samples of unequal size, $n_{1}$ and $n_{2}$ , with unknown means $μ_{1}$ and $μ_{2}$ and an equal unknown variance $σ^{2}$ . Then, given the null hypothesis $H_{0} = μ_{1} - μ_{2} = μ_{Δ},$ the test statistic $T (X_{n_{1}}, X_{n_{2}}) = \frac{μ (X_{n_{1}}) - μ (X_{n_{2}}) - μ_{Δ}}{s_{p} \cdot \sqrt{\frac{1}{n_{1}} + \frac{1}{n_{2}}}} \sim t_{n_{1} + n_{2} - 2},$ is Student-t distributed with $n_{1} + n_{2} - 2$ degrees of freedom and $s_{p} = \sqrt{\frac{(n_{1} - 1) {\hat{s}}^{2} (X_{n_{1}}) + (n_{2} - 1) {\hat{s}}^{2} (X_{n_{2}})}{n_{1} + n_{2} - 2}},$ where ${\hat{s}}^{2} (X_{n_{1}})$ and ${\hat{s}}^{2} (X_{n_{2}})$ are the sample corrected variances (Equation 9.11) of the two samples.

23.2.2 Test for two means and unequal variances

Let’s consider two independent Gaussian populations with different variance, i.e. $X_{1} \sim N (μ_{1}, σ_{1}^{2}), X_{2} \sim N (μ_{2}, σ_{2}^{2}) .$ Then, let’s consider two samples of unequal size, $n_{1}$ and $n_{2}$ , with unknown means $μ_{1}$ and $μ_{2}$ and an unequal unknown variance $σ^{2}$ . Then, given the null hypothesis $H_{0} = μ_{1} - μ_{2} = μ_{Δ},$ Welch (1938) - Welch (1947) proposes a test statistic $T (X_{n_{1}}, X_{n_{2}}) = \frac{μ (X_{n_{1}}) - μ (X_{n_{2}})}{\sqrt{\frac{{\hat{s}}^{2} (X_{n_{1}})}{n_{1}} + \frac{{\hat{s}}^{2} (X_{n_{2}})}{n_{2}}}} \approx t_{ν},$ that follows approximately a Student t-distribution under the null hypothesis, but with fractional degrees of freedom computed using the Welch–Satterthwaite approximation. This is a weighted average of the degrees of freedom from each group, reflecting the uncertainty due to unequal variances, i.e. $ν = \frac{{(\frac{{\hat{s}}^{2} (X_{n_{1}})}{n_{1}} + \frac{{\hat{s}}^{2} (X_{n_{1}})}{n_{2}})}^{2}}{\frac{({\hat{s}}^{2} (X_{n_{1}}))^{2}}{n_{1}^{2} (n_{1} - 1)} + \frac{({\hat{s}}^{2} (X_{n_{1}}))^{2}}{n_{2}^{2} (n_{2} - 1)}} .$ where $ν$ is not necessary an integer.

23.3 Tests for the variances

23.3.1 F-test for two variances

Consider two independent normal samples, i.e. $X_{n_{1}} \sim N (μ_{1}, σ_{1}^{2}), X_{n_{2}} \sim N (μ_{2}, σ_{2}^{2}),$ where $n_{1}$ and $n_{2}$ are the number of observations in each sample. $H_{0} : σ_{1}^{2} = σ_{2}^{2} = σ^{2} .$ Knowing that the sample variance is chi2 distributed (Equation 9.15) let’s define the variables: $T_{1} = (n_{1} - 1) \frac{{\hat{s}}_{1}^{2}}{σ_{1}^{2}} \sim χ_{n_{1} - 1}^{2}, T_{2} = (n_{2} - 1) \frac{{\hat{s}}_{2}^{2}}{σ_{2}^{2}} \sim χ_{n_{2} - 1}^{2} .$ Then, since the ration of two independent $χ^{2}$ divided by their respective degrees of freedom is $F$ -distributed (Equation 33.3) the statistic $F$ is defined as: $T (X_{n_{1}}, X_{n_{2}}) = \frac{\frac{T_{1}}{n_{1} - 1}}{\frac{T_{2}}{n_{1} - 1}} = \frac{\frac{{\hat{s}}_{1}^{2}}{σ_{1}^{2}}}{\frac{{\hat{s}}_{2}^{2}}{σ_{2}^{2}}} = \frac{{\hat{s}}_{1}^{2} σ_{2}^{2}}{{\hat{s}}_{2}^{2} σ_{1}^{2}} \sim F_{n_{1} - 1, n_{2}, - 1}$ Under $H_{0}$ the two variances are assumed to be equal, i.e. $σ_{1}^{2} = σ_{2}^{2} = σ^{2}$ , thus the statistic simplifies in: $T (X_{n_{1}}, X_{n_{2}}) \overset{H_{0}}{=} \frac{{\hat{s}}_{1}^{2}}{{\hat{s}}_{2}^{2}} \sim F_{n_{1} - 1, n_{2}, - 1}$ This means that the null hypothesis of equal variances can be rejected when F is as extreme or more extreme than the critical value obtained from the $F$ -distribution with degrees of freedom $n_{1} - 1$ and $n_{2} - 1$ using a significance level $α$ .