23  Hypothesis tests

Setup
library(dplyr)
library(ggplot2)
# ================== Setups ==================
n <- 500 # number of simulations 
set.seed(1) # random seed 
mu_0 <- 2.4 # H0 mean 
mu_true <- 2 # true mean
alpha <- 0.05 # confidence level
# ============================================
# Simulated random variable 
x <- rnorm(n, mean = mu_true, sd = 4)
# Grid of points for pdf
x_limits <- c(-4,4)
x_grid <- seq(x_limits[1], x_limits[2], by = 0.01)
x_breaks <- seq(x_limits[1], x_limits[2], by = 1)

A statistical hypothesis is a claim about the value of a parameter or population characteristic. In any hypothesis-testing problem, there are always two competing hypotheses under consideration

  1. The null hypothesis H0 representing the status quo.
  2. The alternative hypothesis H1 representing the research.

The objective of hypothesis testing is to decide, based on sample information, if the alternative hypotheses is actually supported by the data. One usually do new research to challenge the existing beliefs.

Is there strong evidence for the alternative?

Let’s consider that you want to establish if the null hypothesis H0 is not supported by the data. One usually assume to work under H0, then if the sample does not strongly contradict H0, we will continue to believe in the plausibility of the null hypothesis. There are only two possible conclusions: Reject H0 or Fail to reject H0.

Definition 23.1 The test statistic T(Xn) is a function of a sample Xn and is used to make a decision about whether the null hypothesis should be rejected or not. In theory, there are an infinite number of possible tests that could be devised. The choice of a particular test procedure must be based on the probability the test will produce incorrect results. In general, two kind of errors are related with test statistics, i.e. 

  1. A type I error is when the null hypothesis is rejected, but it is true.
  2. A type II error is not rejecting the null when it is false.

The p-value is in general related to the probability of the type I error. So, the smaller the P-value, the more evidence there is in the sample data against the null hypothesis and for the alternative hypothesis.

In general, before performing a test one establish a significance level α (the desired type I error probability), that defines the rejection region. Then the decision rule is: Reject H0p-value αDo not reject H0p-value >α The p-value can be thought of as the smallest significance level at which H0 can be rejected and the calculation of the P-value depends on whether the test is upper, lower, or two-tailed.

For example, let’s consider a sample Xn of data. Then, a statistical test consists of the following:

  1. an assumption about the distribution of the data, often expressed in terms of a statistical model M;
  2. a null hypothesis H0 and an alternative hypothesis H1 which make specific statements about the data;
  3. a test statistic T(Xn) which is a function of the data and whose distribution under the null hypothesis is known;
  4. a significance level α which imposes an upper bound on the probability of rejecting H0, given that H0 is true.

The general procedure for a statistical hypothesis test can be summarized as follows:

  1. Inputs: consider a null hypothesis H0 and the significance level α.
  2. Critical value: compute the value tα that determine the partitions the set of possible values of T(Xn) into rejection and non rejection regions.
  3. Output: compare the observed test statistic T(Xn) computed on the sample with the critical value tα. If it is in the rejection region, H0 is rejected in favor of H1. Otherwise, the test fails to reject H0.
Step Description
Inputs H0, α.
Critical value Critical level t(α)
Output Rejection or not depending on T(Xn)

In general, two kind of tests are available:

23.1 Left and right tailed tests

For example, let’s simulate a sample Xn of n=500 observations from a normal distribution (i.e. XnN(2,42)) and consider the following sets of hypothesis, i.e. H0:μ(X)=2.4H1:μ(X)2.4 The statistic test is defined as T(Xn)=500μ(Xn)2.4σ(Xn)H0t(499). Since it is a two-tailed test the critical value for a significance level α, denoted as tα, is such that: α=P([T(Xn)<tα/2][T(Xn)>tα/2])tα/2=P1(P(T(Xn)>tα/2)), where P1 and P are respectively the quantile and distribution functions of a Student-t. If the statistic test |T(Xn)|>|tα/2|, then we reject H0 and so the mean of the sample is significantly different from 2.4. More precisely, with α=0.05, the critical value of a Student-t with 499 degrees of freedom is tα/2=1.9604.

Two-tailed test
# Statistic T
z <- sqrt(n)*(mean(x) - mu_0)/sd(x)
# Student-t density
pdf <- dt(x_grid, df = n-1)
# Critical value left 
z_left <- c(qt(alpha/2, df = n-1), dt(qt(alpha/2, df = n-1), df = n-1))
# Critical value right 
z_right <- c(qt(1-alpha/2, df = n-1), dt(qt(1-alpha/2, df = n-1), df = n-1))
Plot t-test
# Area left tail 
x_left <- x_grid[x_grid < z_left[1]]
y_left <- dt(x_left, df = n-1)
# Area right tail 
x_right <- x_grid[x_grid > z_right[1]]
y_right <- dt(x_right, df = n-1)
# Central area
x_centre <- x_grid[x_grid > z_left[1] & x_grid < z_right[1]]
y_centre <- dt(x_centre, df = n-1)
ggplot()+
  geom_segment(aes(x = z_left[1], xend = z_left[1], y = 0, yend = z_left[2]), color = "red")+
  geom_segment(aes(x = z_right[1], xend = z_right[1], y = 0, yend = z_right[2]), color = "red")+
  geom_ribbon(aes(x = x_centre, ymin = 0, ymax = y_centre, fill = "norej"), alpha = 0.3)+
  geom_ribbon(aes(x = x_left, ymin = 0, ymax = y_left, fill = "rej"), alpha = 0.3)+
  geom_ribbon(aes(x = x_right, ymin = 0, ymax = y_right, fill = "rej"), alpha = 0.3)+
  geom_line(aes(x_grid, pdf))+
  geom_point(aes(z, 0), color = "black")+
  scale_fill_manual(values = c(rej = "red", norej = "green"), 
                    labels = c(rej = "Rejection", norej = "No rejection")) + 
  scale_x_continuous(breaks = x_breaks) +
  labs(y = "", x = "x", fill = NULL)+
  theme_bw()+
  theme(
    legend.position = c(.95, .95),
    legend.justification = c("right", "top"),
    legend.box.just = "right",
    legend.margin = margin(6, 6, 6, 6),
    panel.grid = element_blank())
Figure 23.1: Two-tailed test on the mean.

Let’s consider another kind of hypothesis, H0:μ(X)2.4H1:μ(X)<2.4 The statistic test T(Xn) do not changes, however the null hypothesis implies a left-tailed test. Hence, the critical value is tα is such that P(x<tα)=0.05. Applying the quantile function P1 of a student-t we obtain: α=P(T(Xn)<tα)tα=P1(P(T(Xn)<tα)), where P1 and P are respectively the quantile and distribution functions of a Student-t. In this case, with α=0.05, the critical value of a Student-t with 499 degrees of freedom is tα/2=1.6451. Therefore, if T(Xn)<1.6451 we do not reject the null hypothesis, i.e. μ(Xn) is greater than μ0, otherwise we reject it and μ(Xn) is lower than μ0.

Left-tailed test
# Critical value left 
z_left <- c(qt(alpha, df = n-1), dt(qt(alpha, df = n-1), df = n-1))
# Area left tail 
x_left <- x_grid[x_grid < z_left[1]]
y_left <- dt(x_left, df = n-1)
# Area right tail 
x_right <- x_grid[x_grid > z_left[1]]
y_right <- dt(x_right, df = n-1)
ggplot()+
  geom_segment(aes(x = z_left[1], xend = z_left[1], y = 0, yend = z_left[2]), color = "red")+
  geom_ribbon(aes(x = x_left, ymin = 0, ymax = y_left, fill = "rej"), alpha = 0.3)+
  geom_ribbon(aes(x = x_right, ymin = 0, ymax = y_right, fill = "norej"), alpha = 0.3)+
  geom_line(aes(x_grid, pdf))+
  geom_point(aes(z, 0), color = "black")+
  scale_fill_manual(values = c(rej = "red", norej = "green"), 
                    labels = c(rej = "Rejection", norej = "No rejection")) + 
  scale_x_continuous(breaks = x_breaks) +
  labs(y = "", x = "x", fill = NULL)+
  theme_bw()+
  theme_bw()+
  theme(
    legend.position = c(.95, .95),
    legend.justification = c("right", "top"),
    legend.box.just = "right",
    legend.margin = margin(6, 6, 6, 6),
    panel.grid = element_blank())
Figure 23.2: Left-tailed test on the mean.

In this case we reject the null hypothesis, hence μ(Xn) is lower than μ0.

Lastly, let’s consider the right-tailed case, i.e. H0:μ(X)2.4H1:μ(X)>2.4 It is always one-side test, but in this case is right-tailed. Hence, the critical value tα is such that 1α=P(T(Xn)<tα)tα=P1(P(T(Xn)<tα)), where P1 and P are respectively the quantile and distribution functions of a Student-t. In this case, with α=0.05, the critical value of a Student-t with 499 degrees of freedom is tα/2=1.6451. Therefore, if T(Xn)<1.6451 we do not reject the null hypothesis, i.e. μ(Xn) is lower than μ0, otherwise we reject it and μ(Xn) is greater than μ0. Coherently with the previous test performed in , a right railed test is not rejected in , hence μ(Xn) is lower than μ0=2.4.

Right-tailed test
# Critical value right 
z_right <- c(qt(1-alpha, df = n-1), dt(qt(1-alpha, df = n-1), df = n-1))
# Area right tail 
x_left <- x_grid[x_grid > z_right[1]]
y_left <- dt(x_left, df = n-1)
# Area right tail 
x_right <- x_grid[x_grid < z_right[1]]
y_right <- dt(x_right, df = n-1)
ggplot()+
  geom_segment(aes(x = x_left[1], xend = x_left[1], y = 0, yend = z_right[2]), color = "red")+
  geom_ribbon(aes(x = x_left, ymin = 0, ymax = y_left, fill = "rej"), alpha = 0.3)+
  geom_ribbon(aes(x = x_right, ymin = 0, ymax = y_right, fill = "norej"), alpha = 0.3)+
  geom_line(aes(x_grid, pdf))+
  geom_point(aes(z, 0), color = "black")+
  scale_fill_manual(values = c(rej = "red", norej = "green"), 
                    labels = c(rej = "Rejection area", norej = "Non rejection")) + 
  scale_x_continuous(breaks = x_breaks) +
  labs(y = "", x = "x", fill = NULL)+
  theme_bw()+
  theme(
    legend.position = c(.95, .95),
    legend.justification = c("right", "top"),
    legend.box.just = "right",
    legend.margin = margin(6, 6, 6, 6),
    panel.grid = element_blank())
Figure 23.3: Right-tailed test on the mean.

23.2 Tests for the means

Proposition 23.1 Let’s consider the t-test for the mean of a sample of identically and normally distributed random variables Xn=(x1,,xi,,xn). Then the test statistic T(Xn) under H0:μ^(Xn)=μ0 is student-t distributed with n1 degrees of freedom., i.e.  T(Xn)=μ^(Xn)μ0s^(Xn)nH0tn1, where μ^(Xn) is the sample mean μ^(Xn) and σ^(Xn) the corrected sample variance. Moreover, for n: T(Xn)nH0N(0,1).

Proof. In the sample is normally distributed, the sample mean is also normally distributed, i.e.  M=nμ^(Xn)μ0σN(0,1). Under normality the sample variance, that is a sum of the square of independent and normally distributed random variables, follows a χ2 distribution with n1 degrees of freedom, i.e.  V=(n1)s^2(Xn)σ2χn12. Notably, the ratio of a standard normal and a χ2 random variables (each one divided by the respective degrees of freedom) is exactly the definition of a Student-t random variable as in . Hence, the ratio between the statistics M and V divided by their degrees of freedom reads MVn1=nμ^(Xn)μ0σσ2s^2(Xn)=nμ^(Xn)μ0s^2(Xn)tn1. The statistic test under H0 follows a Student-t distribution with n1 degrees of freedom. Notably, for large IID samples the statistic converges to a normal random variable independently from the distribution of X.

23.2.1 Test for two means and equal variances

Let’s consider two independent Gaussian populations with equal variance, i.e.  X1N(μ1,σ2),X2N(μ2,σ2) Then, let’s consider two samples of unequal size, n1 and n2, with unknown means μ1 and μ2 and an equal unknown variance σ2. Then, given the null hypothesis H0=μ1μ2=μΔ, the test statistic T(Xn1,Xn2)=μ(Xn1)μ(Xn2)μΔsp1n1+1n2tn1+n22, is Student-t distributed with n1+n22 degrees of freedom and sp=(n11)s^2(Xn1)+(n21)s^2(Xn2)n1+n22, where s^2(Xn1) and s^2(Xn2) are the sample corrected variances () of the two samples.

23.2.2 Test for two means and unequal variances

Let’s consider two independent Gaussian populations with different variance, i.e.  X1N(μ1,σ12),X2N(μ2,σ22). Then, let’s consider two samples of unequal size, n1 and n2, with unknown means μ1 and μ2 and an unequal unknown variance σ2. Then, given the null hypothesis H0=μ1μ2=μΔ, Welch () - Welch () proposes a test statistic T(Xn1,Xn2)=μ(Xn1)μ(Xn2)s^2(Xn1)n1+s^2(Xn2)n2tν, that follows approximately a Student t-distribution under the null hypothesis, but with fractional degrees of freedom computed using the Welch–Satterthwaite approximation. This is a weighted average of the degrees of freedom from each group, reflecting the uncertainty due to unequal variances, i.e.  ν=(s^2(Xn1)n1+s^2(Xn1)n2)2(s^2(Xn1))2n12(n11)+(s^2(Xn1))2n22(n21). where ν is not necessary an integer.

23.3 Tests for the variances

23.3.1 F-test for two variances

Consider two independent normal samples, i.e.  Xn1N(μ1,σ12),Xn2N(μ2,σ22), where n1 and n2 are the number of observations in each sample. H0:σ12=σ22=σ2. Knowing that the sample variance is chi2 distributed () let’s define the variables: T1=(n11)s^12σ12χn112,T2=(n21)s^22σ22χn212. Then, since the ration of two independent χ2 divided by their respective degrees of freedom is F-distributed () the statistic F is defined as: T(Xn1,Xn2)=T1n11T2n11=s^12σ12s^22σ22=s^12σ22s^22σ12Fn11,n2,1 Under H0 the two variances are assumed to be equal, i.e. σ12=σ22=σ2, thus the statistic simplifies in: T(Xn1,Xn2)=H0s^12s^22Fn11,n2,1 This means that the null hypothesis of equal variances can be rejected when F is as extreme or more extreme than the critical value obtained from the F-distribution with degrees of freedom n11 and n21 using a significance level α.