6 Conditional expectation

Reference: Pietro Rigo (2023)

Theorem 6.1 ( $Radom Nikodym$ )
Consider a measure space $(Ω, B)$ and two measures $μ$ , $ν$ such that $μ$ is $σ$ -finite (Definition 30.4) and $μ << ν$ (Definition 30.1). Then there exists a measurable function $X : Ω \to R$ such that: $μ (B) = \int_{A} X d ν \forall B \in B$

Definition 6.1 ( $Conditional expectation$ )
Given a probability space $(Ω, B, P)$ . Let’s consider a sub $σ$ -field of $B$ , i.e. $G \subset B$ and a random variable $X : Ω \to R$ with finite expectation $E {| X |} < + \infty$ . Then, the conditional expectation of $X$ given $G$ is any random variable $Z = E {X ∣ G},$ such that:

$Z$ has finite expectation, i.e. $E {| Y |} < + \infty$ .
$Z$ is $G$ -measurable.
$E {1_{A} Z} = E {1_{A} X}$ , $\forall A \in G$ , namely if $X$ and $Z$ are restricted to $A \in G$ , then their expectation coincides.

A $σ$ -field can be used to describe our state of information. It means that, $\forall A \in G$ we already know if the event $A$ has occurred or not. Therefore, when we insert in $G$ the events that we know were already occurred, we are saying that the random variable $Z$ is $G$ -measurable, i.e. the value of $Z$ is not stochastic once we know the information contained in $G$ . In this context, one can see the random variable $Y = E {X ∣ G}$ as the prediction of $X$ , given the information contained in the sub $σ$ -field $G$ .

Definition 6.2 ( $Predictor$ )
Consider $Z$ any $G$ -measurable random variable. Then $Z$ can be interpreted as a predictor of another random variable $X$ under the information contained in the $σ$ -field $G$ . When we substitute $X$ with its prediction $Z$ , we make an error given by the difference $X - Z$ . In the special case in which $E {| Z |^{2}} < \infty$ and using as error function the mean squared error, i.e. $E {{error}^{2}} = E {(X - Z)^{2}},$ then it is possible to prove that the conditional expectation $E {X ∣ G}$ represent the best predictor of $X$ in the sense that it minimized the mean squared error, i.e. $E {(X - E {X ∣ G})^{2}} = min_{Z \in Z} [E {(X - Z)^{2}}] .$ Hence, $E {X | G}$ is the best predictor that minimize the mean squared error over the class $Z$ composed by $G$ -measurable functions with finite second moment, i.e. $Z = E {X ∣ G} = \underset{Z \in Z}{argmin} [E {(X - Z)^{2}}],$ where $Z = {Z : Z is G -measurable and E {| Z |^{2}} < \infty}$ .

6.1 Properties of conditional expectation

Here we state some useful properties of conditional expectation:

Linearity: The conditional expectation is linear for all all constants $a, b \in R$ , i.e. $E {a X + b Y ∣ G} = a E {X ∣ G} + b E {Y ∣ G} .$
Positive: $X \geq 0$ implies that $E {X ∣ G} \geq 0$ .
Measurability: If $Y$ is $G$ -measurable, then $E {X Y ∣ G} = Y E {X ∣ G}$ . In general, if $X$ is $G$ -measurable then $E {X ∣ G} = X$ , i.e. $X$ is not stochastic.
Constant: The conditional expectation of a constant is a constant, i.e. $E {a ∣ G} = a \forall a \in R .$
Independence: If $X$ is independent from the $σ$ -field $G$ , then $E {X ∣ G} = E {X}$ .
Chain rule: If one consider two sub $σ$ -fields of $B$ such that $G_{1} \subset G_{2}$ , then we can write: $\begin{matrix} (6.1) & E {X ∣ G_{1}} = E {E {X ∣ G_{2}} ∣ G_{1}} ⟺ G_{1} \subset G_{2} . \end{matrix}$ Remember that, when using the chain rule it is mandatory to take the conditional expectation before with respect to the greatest $σ$ -field, i.e. the one that contains more information (in this case $G_{2}$ ), and then with respect to the smallest one (in this case $G_{1}$ ).

6.2 Conditional probability

Proposition 6.1 ( $Conditional probability$ )
Given a probability space $(Ω, F, P)$ , consider $G$ as a sub $σ$ -field of $F$ , i.e. $G \subset F$ . Then the general definition of the conditional probability of an event $A$ , given $G$ , is $\begin{matrix} (6.2) & P (A ∣ G) = E (1_{A} ∣ G) . \end{matrix}$ Instead, the elementary definition do not consider the conditioning with respect to a $σ$ -field, but instead with respect to a single event $B$ . In practice, take an event $B \in F$ such that $0 < P (B) < 1$ , then $\forall A \in F$ the conditional probability of $A$ given $B$ is defined as: $\begin{matrix} (6.3) & P (A ∣ B) = \frac{P (A \cap B)}{P (B)}, P (A ∣ B^{c}) = \frac{P (A \cap B^{c})}{P (B^{c})} . \end{matrix}$

Proof: Proposition 6.1

Proof. The elementary (Equation 6.3) and the general (Equation 6.2) definitions are equivalent, in fact consider a sub $σ$ -field $G$ which provides only the information concerning whenever $ω$ is in $B$ or not. A $σ$ -field of this kind will have the form $G_{B} = {Ω, \emptyset, B, B^{c}}$ . Then, consider a $G_{B}$ -measurable function, $f : Ω \to R$ , such that: $f (ω) = {\begin{cases} α ω \in B \\ β ω \in B^{c} \end{cases}$ It remains to find $α$ and $β$ in the following expression: $P (A ∣ G_{B}) = E {1_{A} ∣ G_{B}} = α 1_{B} + β 1_{B^{c}} .$ Note that, the joint probability of $A$ and $B$ can be obtained as: $\begin{aligned} P (A \cap B) & = E {1_{A} 1_{B}} = E {E {1_{A} 1_{B} ∣ G_{B}}} = \\ = E {E {1_{A} ∣ G_{B}} 1_{B}} = \\ = E {P (A ∣ G_{B}) 1_{B}} = \\ = E {(α 1_{B} + β 1_{B^{c}}) 1_{B}} = \\ = α E {1_{B}} + β E {1_{B^{c}} 1_{B}} = \\ = α P (B) \end{aligned}$ Hence, we obtain: $P (A \cap B) = α P (B) ⟹ α = \frac{P (A \cap B)}{P (B)} .$ Equivalently for $P (A \cap B^{c})$ it is possible to prove that: $P (A \cap B^{c}) = β P (B^{c}) ⟹ β = \frac{P (A \cap B^{c})}{P (B^{c})} .$ Finally, thanks to this result it is possible to express the conditional probability in the general definition (Equation 6.2) as a linear combination of conditional probabilities defined accordingly to the elementary definition (Equation 6.3), i.e. $P (A ∣ G_{B}) = P (A ∣ B) 1_{B} + P (A ∣ B^{c}) 1_{B^{c}} .$

Exercise 6.1 Let’s continue from the Exercise 3.1 and let’s say that we observe $X (ω) = {+ 1}$ , then we ask ourselves, what is the probability that in the next extraction $X (ω) = {0}$ ?

Solution: Exercise 6.1

Solution 6.1. The chances that with 52 cards we obtain $X (ω) = {0}$ is approximately $\frac{12}{52} \approx 23.08 %$ (see Solution 3.2), while $X (ω) = {1}$ is approximately $\frac{20}{52} \approx 38.46 %$ .

Hence, if we have extracted a card that originates an $X (ω) = {+ 1}$ , then in the deck remain only 19 possible cards that gives ${+ 1}$ while the number of total cards reduces to 51. Thus, the conditional probability on another 1 decreases $P (X (ω_{2}) = {+ 1} ∣ X (ω_{1}) = {+ 1}) = \frac{19}{51} = 37.25 % .$ On the other hand, the conditional probability of a 0 increases $P (X (ω_{2}) = {0} ∣ X (ω_{1}) = {+ 1}) = \frac{12}{51} = 23.52 % .$

Proposition 6.2 ( $Conditional probability and independent events$ )
If two events $A$ and $B$ are independent and $P (B) > 0$ , then $A ⊥ B ⟺ P (A ∣ B) = P (A) .$

Proof: Proposition 6.2

Proof. To prove the result in Proposition 6.2 we consider both sides of the expression.

(LHS $⟹$ RHS) Let’s assume that $A$ and $B$ are independent, i.e. $A ⊥ B$ . Then, by definition (Definition 4.1) under independence holds the decomposition in Equation 4.1, i.e. $P (A \cap B) = P (A) P (B) ⟹ P (A ∣ B) = \frac{P (A \cap B)}{P (B)} = P (A) .$
(RHS $⟹$ LHS) Let’s assume that the following holds, i.e. $P (A ∣ B) = P (A)$ and $P (B) > 0$ . Then, by definition of conditional probability (Equation 6.3) the joint probability is defined as: $P (A \cap B) = P (A ∣ B) P (B) = P (A) P (B),$ that is exactly the definition of independence (Definition 4.1).

Theorem 6.2 ( $Bayes’ Theorem$ )
Let’s consider a partition of disjoint events ${A_{1}, \dots, A_{n}}$ with each $A_{n} \subset Ω$ and such that $⊔_{i = 1}^{n} A_{i} = Ω$ . Then, given any event $B \subset Ω$ with probability greater than zero, $P (B) > 0$ . Then, for any $j \in {1, 2, \dots, n}$ the conditional probability of the event $A_{j}$ given $B$ is defined as: $P (A_{j} ∣ B) = \frac{P (B ∣ A_{j}) P (A_{j})}{\sum_{i = 1}^{n} P (B ∣ A_{i}) P (A_{i})} .$

Conditional probability: numerical example

Example 6.1 Let’s consider two random variables $X (ω)$ and $Y (ω)$ taking values in ${0, 1}$ . The marginal probabilities $P (X = 0) = 0.6$ and $P (Y = 0) = 0.29$ . Let’s consider the matrix of joint events and probabilities, i.e. $(\begin{matrix} [X = 0] \cap [Y = 0] & [X = 0] \cap [Y = 1] \\ [X = 1] \cap [Y = 0] & [X = 1] \cap [Y = 1] \end{matrix}) \overset{P}{⟶} (\begin{matrix} 0.17 & 0.43 \\ 0.12 & 0.28 \end{matrix}) .$ Then, by definition the conditional probabilities are defined as: $P (X = 0 ∣ Y = 0) = \frac{P (X = 0 \cap Y = 0)}{P (Y = 0)} = \frac{0.17}{0.29} \approx 58.63 %,$ and $P (X = 0 ∣ Y = 1) = \frac{P (X = 0 \cap Y = 1)}{P (Y = 1)} = \frac{0.43}{1 - 0.29} \approx 60.56 % .$ Considering $Y$ instead: $P (Y = 0 ∣ X = 0) = \frac{P (Y = 0 \cap X = 0)}{P (X = 0)} = \frac{0.17}{0.6} \approx 28.33 %,$ and $P (Y = 0 ∣ X = 1) = \frac{P (Y = 0 \cap X = 1)}{P (X = 1)} = \frac{0.12}{1 - 0.6} \approx 30 %,$ Then, it is possible to express the marginal probability of $X$ as: $\begin{aligned} P (X = 0) & = E {P (X = 0 ∣ Y)} = \\ = P (X = 0 ∣ Y = 0) P (Y = 0) + P (X = 0 ∣ Y = 1) P (Y = 1) = \\ = 0.5863 \cdot 0.29 + 0.6056 \cdot (1 - 0.29) \approx 60 % \end{aligned}$ And similarly for $Y$ $\begin{aligned} P (Y = 0) & = E {P (Y = 0 ∣ X)} = \\ = P (Y = 0 ∣ X = 0) P (X = 0) + P (Y = 0 ∣ X = 1) P (X = 1) = \\ = 0.2833 \cdot 0.6 + 0.30 \cdot (1 - 0.6) \approx 29 % \end{aligned}$

Exercise 6.2 ( $Monty Hall Problem$ )
You are on a game show where there are three closed doors: behind one is a car (the prize) and behind the other two are goats.

The rules are simple:

You choose one door (say Door 1).
The host, who knows where the car is, opens one of the other doors, always revealing a goat.
You are then offered the chance to stay with your original choice or switch to the other unopened door.

Question: Is in your interests to switch door? (See 21 Blackjack)

Solution: Exercise 6.2

Solution 6.2. Before any door is opened, the probability that the car is behind each door is $P (car behind door 1) = P (car behind door 2) = P (car behind door 3) = \frac{1}{3} = 33. \bar{3} % .$ Suppose you picked Door 1. The conductor opens (say) Door 3, revealing a goat. Now the conditional probabilities are:

If the car is behind Door 1: Monty could open either Door 2 or Door 3 with equal probability.
If the car is behind Door 2: Monty is forced to open Door 3.
If the car is behind Door 3: Monty is forced to open Door 2 (so this case is impossible if Monty opens Door 3).

Apply Bayes’ Rule: $P (car behind door 1 ∣ Monty opens door 3) = \frac{\frac{1}{3} \cdot \frac{1}{2}}{\frac{1}{3} \cdot \frac{1}{2} + \frac{1}{3} \cdot 1} = \frac{1 / 6}{1 / 6 + 1 / 3} = \frac{1}{3} = 33. \bar{3} % .$ On the other hand, the other door has probabvility of winning of $P (car behind door 2 ∣ Monty opens door 3) = \frac{\frac{1}{3} \cdot 1}{\frac{1}{3} \cdot \frac{1}{2} + \frac{1}{3} \cdot 1} = \frac{1 / 3}{1 / 6 + 1 / 3} = \frac{2}{3} = 66. \bar{6} % .$ After Monty opens a goat door, the probability the car is behind your original choice is still 33 $%$ , while the probability it is behind the other unopened door is 66 $%$ , almost double . Therefore, switching doubles the chances of winning.

6.2.1 Conditional variance

Proposition 6.3 ( $Conditional variance$ )
Let’s consider two random variable $X$ and $Y$ with finite second moment. Then, the total variance can be expressed as: $V {X} = E {V {X ∣ Y}} + V {E {X ∣ Y}} ⟺ V {Y} = E {V {Y ∣ X}} + V {E {Y ∣ X}}$

Proof: Proposition 6.3

Proof. By definition, the variance of a random variable reads: $V {X} = E {X^{2}} - E {X}^{2} .$ Applying the chain rule (Equation 6.1) one can write $V {X} = E {E {X^{2} ∣ Y}} - E {E {X ∣ Y}}^{2} .$ Then, add and subtract $E {E {X ∣ Y}^{2}}$ gives $V {X} = E {E {X^{2} ∣ Y}} - E {E {X ∣ Y}}^{2} + E {E {X ∣ Y}^{2}} - E {E {X ∣ Y}^{2}} .$ Group the first and fourth terms and the second and third to obtain $V {X} = E {E {X^{2} ∣ Y} - E {X ∣ Y}^{2}} + E {E {X ∣ Y}^{2}} - E {E {X ∣ Y}}^{2},$ that simplifies as $V {X} = E {V {X ∣ Y}} + V {E {X ∣ Y}} .$