6  Conditional expectation

Reference: Pietro Rigo (2023)

Theorem 6.1 (\(\color{magenta}{\textbf{Radom Nikodym}}\))
Consider a measure space \((\Omega, \mathcal{B})\) and two measures \(\mu\), \(\nu\) such that \(\mu\) is \(\sigma\)-finite (Definition 30.4) and \(\mu << \nu\) (Definition 30.1). Then there exists a measurable function \(X: \Omega \to \mathbb{R}\) such that: \[ \mu(B) = \int_A X d\nu \quad \forall B \in \mathcal{B} \]

Definition 6.1 (\(\color{magenta}{\textbf{Conditional expectation}}\))
Given a probability space \((\Omega, \mathcal{B}, \mathbb{P})\). Let’s consider a sub \(\sigma\)-field of \(\mathcal{B}\), i.e. \(\mathcal{G} \subset \mathcal{B}\) and a random variable \(X: \Omega \rightarrow \mathbb{R}\) with finite expectation \(\mathbb{E}\{|X|\} < + \infty\). Then, the conditional expectation of \(X\) given \(\mathcal{G}\) is any random variable \[ Z = \mathbb{E}\{X \mid \mathcal{G}\} \text{,} \] such that:

  1. \(Z\) has finite expectation, i.e. \(\mathbb{E}\{|Y|\} < + \infty\).
  2. \(Z\) is \(\mathcal{G}\)-measurable.
  3. \(\mathbb{E}\{\mathbb{1}_{A} Z \} = \mathbb{E}\{\mathbb{1}_{A}X\}\), \(\forall A \in \mathcal{G}\), namely if \(X\) and \(Z\) are restricted to \(A \in \mathcal{G}\), then their expectation coincides.

A \(\sigma\)-field can be used to describe our state of information. It means that, \(\forall A \in \mathcal{G}\) we already know if the event \(A\) has occurred or not. Therefore, when we insert in \(\mathcal{G}\) the events that we know were already occurred, we are saying that the random variable \(Z\) is \(\mathcal{G}\)-measurable, i.e. the value of \(Z\) is not stochastic once we know the information contained in \(\mathcal{G}\). In this context, one can see the random variable \(Y = \mathbb{E}\{X \mid \mathcal{G}\}\) as the prediction of \(X\), given the information contained in the sub \(\sigma\)-field \(\mathcal{G}\).

Definition 6.2 (\(\color{magenta}{\textbf{Predictor}}\))
Consider \(Z\) any \(\mathcal{G}\)-measurable random variable. Then \(Z\) can be interpreted as a predictor of another random variable \(X\) under the information contained in the \(\sigma\)-field \(\mathcal{G}\). When we substitute \(X\) with its prediction \(Z\), we make an error given by the difference \(X - Z\). In the special case in which \(\mathbb{E}\{|Z|^2\} < \infty\) and using as error function the mean squared error, i.e.  \[ \mathbb{E}\{\text{error}^2\} = \mathbb{E}\{(X - Z)^2\} \text{,} \] then it is possible to prove that the conditional expectation \(\mathbb{E}\{X \mid \mathcal{G}\}\) represent the best predictor of \(X\) in the sense that it minimized the mean squared error, i.e.  \[ \mathbb{E}\{(X - \mathbb{E}\{X \mid \mathcal{G}\})^2\} = \underset{\small Z \in \mathcal{Z}}{\min}\left[\mathbb{E}\{(X - Z)^2\}\right] \text{.} \] Hence, \(\mathbb{E}\{X|\mathcal{G}\}\) is the best predictor that minimize the mean squared error over the class \(\mathcal{Z}\) composed by \(\mathcal{G}\)-measurable functions with finite second moment, i.e. \[ Z = \mathbb{E}\{X \mid \mathcal{G}\} = \underset{\small Z \in \mathcal{Z}}{\text{argmin}}\left[\mathbb{E}\{(X - Z)^2\}\right] \text{,} \] where \(\mathcal{Z} = \{Z : Z \text{ is } \mathcal{G}\text{-measurable} \; \text{and} \; \mathbb{E}\{|Z|^2\} < \infty \}\).

6.1 Properties of conditional expectation

Here we state some useful properties of conditional expectation:

  1. Linearity: The conditional expectation is linear for all all constants \(a, b \in \mathbb{R}\), i.e.  \[ \mathbb{E}\{ a X + b Y \mid \mathcal{G} \} = a \mathbb{E}\{X \mid \mathcal{G} \} + b \mathbb{E}\{Y \mid \mathcal{G}\} \text{.} \]
  2. Positive: \(X \ge 0\) implies that \(\mathbb{E}\{X \mid \mathcal{G}\} \ge 0\).
  3. Measurability: If \(Y\) is \(\mathcal{G}\)-measurable, then \(\mathbb{E}\{XY \mid \mathcal{G}\} = Y \mathbb{E}\{X \mid \mathcal{G}\}\). In general, if \(X\) is \(\mathcal{G}\)-measurable then \(\mathbb{E}\{X \mid \mathcal{G}\} = X\), i.e. \(X\) is not stochastic.
  4. Constant: The conditional expectation of a constant is a constant, i.e.  \[ \mathbb{E}\{a \mid \mathcal{G}\} = a \; \; \forall a \in \mathbb{R} \text{.} \]
  5. Independence: If \(X\) is independent from the \(\sigma\)-field \(\mathcal{G}\), then \(\mathbb{E}\{X \mid \mathcal{G}\} = \mathbb{E}\{X\}\).
  6. Chain rule: If one consider two sub \(\sigma\)-fields of \(\mathcal{B}\) such that \(\mathcal{G_1} \subset \mathcal{G_2}\), then we can write: \[ \mathbb{E}\{X \mid \mathcal{G_1}\} = \mathbb{E}\{\mathbb{E}\{X \mid \mathcal{G_2}\} \mid \mathcal{G_1}\} \iff \mathcal{G_1} \subset \mathcal{G_2} \text{.} \tag{6.1}\] Remember that, when using the chain rule it is mandatory to take the conditional expectation before with respect to the greatest \(\sigma\)-field, i.e. the one that contains more information (in this case \(\mathcal{G_2}\)), and then with respect to the smallest one (in this case \(\mathcal{G_1}\)).

6.2 Conditional probability

Proposition 6.1 (\(\color{magenta}{\textbf{Conditional probability}}\))
Given a probability space \((\Omega, \mathcal{F}, \mathbb{P})\), consider \(\mathcal{G}\) as a sub \(\sigma\)-field of \(\mathcal{F}\), i.e. \(\mathcal{G} \subset \mathcal{F}\). Then the general definition of the conditional probability of an event \(A\), given \(\mathcal{G}\), is \[ \mathbb{P}(A \mid \mathcal{G}) = \mathbb{E}(\mathbb{1}_A \mid \mathcal{G}) \text{.} \tag{6.2}\] Instead, the elementary definition do not consider the conditioning with respect to a \(\sigma\)-field, but instead with respect to a single event \(B\). In practice, take an event \(B \in \mathcal{F}\) such that \(0 < \mathbb{P}(B) < 1\), then \(\forall A \in \mathcal{F}\) the conditional probability of \(A\) given \(B\) is defined as: \[ \mathbb{P}(A \mid B) = \frac{\mathbb{P}(A {\color{blue}{\cap}} B)}{\mathbb{P}(B)} \text{,} \quad \mathbb{P}(A \mid B^c) = \frac{\mathbb{P}(A {\color{blue}{\cap}} B^c)}{\mathbb{P}(B^c)} \text{.} \tag{6.3}\]

Proof. The elementary (Equation 6.3) and the general (Equation 6.2) definitions are equivalent, in fact consider a sub \(\sigma\)-field \(\mathcal{G}\) which provides only the information concerning whenever \(\omega\) is in \(B\) or not. A \(\sigma\)-field of this kind will have the form \(\mathcal{G}_{B} = \{\Omega, \emptyset, B, B^c\}\). Then, consider a \(\mathcal{G}_{B}\)-measurable function, \(f: \Omega \to \mathbb{R}\), such that: \[ f(\omega) = \begin{cases} \alpha \quad \omega \in B \\ \beta \quad \omega \in B^c \end{cases} \] It remains to find \(\alpha\) and \(\beta\) in the following expression: \[ \mathbb{P}(A \mid \mathcal{G}_B) = \mathbb{E}\{\mathbb{1}_A \mid \mathcal{G}_B\} = \alpha \mathbb{1}_{B} + \beta \mathbb{1}_{B^c} \text{.} \] Note that, the joint probability of \(A\) and \(B\) can be obtained as: \[ \begin{aligned} \mathbb{P}(A {\color{blue}{\cap}} B){} & = \mathbb{E}\{\mathbb{1}_A \mathbb{1}_B \} = \mathbb{E}\{\mathbb{E}\{\mathbb{1}_A \mathbb{1}_B \mid \mathcal{G}_B\}\} = \\ & = \mathbb{E}\{\mathbb{E}\{\mathbb{1}_A \mid \mathcal{G}_B\}\mathbb{1}_B \} = \\ & = \mathbb{E}\{\mathbb{P}(A \mid \mathcal{G}_B)\mathbb{1}_B \} = \\ & = \mathbb{E}\{(\alpha \mathbb{1}_{B} + \beta \mathbb{1}_{B^c})\mathbb{1}_B\} = \\ & = \alpha \mathbb{E}\{\mathbb{1}_{B}\} + \beta \mathbb{E}\{\mathbb{1}_{B^c}\mathbb{1}_B \} = \\ & = \alpha \mathbb{P}(B) \end{aligned} \] Hence, we obtain: \[ \mathbb{P}(A {\color{blue}{\cap}} B) = \alpha \ \mathbb{P}(B) \implies \alpha = \frac{\mathbb{P}(A {\color{blue}{\cap}} B)}{\mathbb{P}(B)} \text{.} \] Equivalently for \(\mathbb{P}(A {\color{blue}{\cap}} B^c)\) it is possible to prove that: \[ \mathbb{P}(A {\color{blue}{\cap}} B^c) = \beta \ \mathbb{P}(B^c) \implies \beta = \frac{\mathbb{P}(A {\color{blue}{\cap}} B^c)}{\mathbb{P}(B^c)} \text{.} \] Finally, thanks to this result it is possible to express the conditional probability in the general definition (Equation 6.2) as a linear combination of conditional probabilities defined accordingly to the elementary definition (Equation 6.3), i.e. \[ \mathbb{P}(A \mid \mathcal{G}_{B}) = \mathbb{P}(A \mid B) \mathbb{1}_B + \mathbb{P}(A \mid B^c) \mathbb{1}_{B^c} \text{.} \]

Exercise 6.1 Let’s continue from the Exercise 3.1 and let’s say that we observe \(X(\omega) = \{+1\}\), then we ask ourselves, what is the probability that in the next extraction \(X(\omega) = \{0\}\)?

Solution 6.1. The chances that with 52 cards we obtain \(X(\omega) = \{0\}\) is approximately \(\frac{12}{52} \approx 23.08 \%\) (see Solution 3.2), while \(X(\omega) = \{1\}\) is approximately \(\frac{20}{52} \approx 38.46 \%\).

Hence, if we have extracted a card that originates an \(X(\omega) = \{+1\}\), then in the deck remain only 19 possible cards that gives \(\{+1\}\) while the number of total cards reduces to 51. Thus, the conditional probability on another 1 decreases \[ \mathbb{P}(X(\omega_2) = \{+1\} \mid X(\omega_1) = \{+1\}) = \frac{19}{51} = 37.25\% \text{.} \] On the other hand, the conditional probability of a 0 increases \[ \mathbb{P}(X(\omega_2) = \{0\} \mid X(\omega_1) = \{+1\}) = \frac{12}{51} = 23.52 \% \text{.} \]

Proposition 6.2 (\(\color{magenta}{\textbf{Conditional probability and independent events}}\))
If two events \(A\) and \(B\) are independent and \(\mathbb{P}(B) > 0\), then \[ A \perp B \iff \mathbb{P}(A \mid B) = \mathbb{P}(A) \text{.} \]

Proof. To prove the result in Proposition 6.2 we consider both sides of the expression.

  1. (LHS \(\implies\) RHS) Let’s assume that \(A\) and \(B\) are independent, i.e. \(A \perp B\). Then, by definition (Definition 4.1) under independence holds the decomposition in Equation 4.1, i.e. \[ \mathbb{P}(A {\color{blue}{\cap}} B) = \mathbb{P}(A) \mathbb{P}(B) \implies \mathbb{P}(A \mid B) = \frac{\mathbb{P}(A {\color{blue}{\cap}} B)}{\mathbb{P}(B)} = \mathbb{P}(A) \text{.} \]

  2. (RHS \(\implies\) LHS) Let’s assume that the following holds, i.e. \(\mathbb{P}(A \mid B) = \mathbb{P}(A)\) and \(\mathbb{P}(B) > 0\). Then, by definition of conditional probability (Equation 6.3) the joint probability is defined as: \[ \mathbb{P}(A {\color{blue}{\cap}} B) = \mathbb{P}(A \mid B) \mathbb{P}(B) = \mathbb{P}(A) \mathbb{P}(B) \text{,} \] that is exactly the definition of independence (Definition 4.1).

Theorem 6.2 (\(\color{magenta}{\textbf{Bayes' Theorem}}\))
Let’s consider a partition of disjoint events \(\{A_1, \dots, A_n\}\) with each \(A_n \subset \Omega\) and such that \({\color{red}\sqcup}_{i=1}^n A_i = \Omega\). Then, given any event \(B \subset \Omega\) with probability greater than zero, \(\mathbb{P}(B) > 0\). Then, for any \(j \in \{1,2,\dots,n\}\) the conditional probability of the event \(A_j\) given \(B\) is defined as: \[ \mathbb{P}(A_j \mid B) = \frac{\mathbb{P}(B \mid A_j) \mathbb{P}(A_j)}{\sum_{i = 1}^{n} \mathbb{P}(B \mid A_i) \mathbb{P}(A_i)} \text{.} \]

Example 6.1 Let’s consider two random variables \(X(\omega)\) and \(Y(\omega)\) taking values in \(\{0,1\}\). The marginal probabilities \(\mathbb{P}(X = 0) = 0.6\) and \(\mathbb{P}(Y = 0) = 0.29\). Let’s consider the matrix of joint events and probabilities, i.e.  \[ \begin{pmatrix} [X = 0] {\color{blue}{\cap}} [Y = 0] & [X = 0] {\color{blue}{\cap}} [Y = 1] \\ [X = 1] {\color{blue}{\cap}} [Y = 0] & [X = 1] {\color{blue}{\cap}} [Y = 1] \end{pmatrix} \overset{\mathbb{P}}{\longrightarrow} \begin{pmatrix} 0.17 & 0.43 \\ 0.12 & 0.28 \end{pmatrix} \text{.} \] Then, by definition the conditional probabilities are defined as: \[ \mathbb{P}(X = 0 \mid Y = 0) = \frac{\mathbb{P}(X = 0 {\color{blue}{\cap}} Y = 0)}{\mathbb{P}(Y = 0)} = \frac{0.17}{0.29} \approx 58.63 \% \text{,} \] and \[ \mathbb{P}(X = 0 \mid Y = 1) = \frac{\mathbb{P}(X = 0 {\color{blue}{\cap}} Y = 1)}{\mathbb{P}(Y = 1)} = \frac{0.43}{1-0.29} \approx 60.56 \% \text{.} \] Considering \(Y\) instead: \[ \mathbb{P}(Y = 0 \mid X = 0) = \frac{\mathbb{P}(Y = 0 {\color{blue}{\cap}} X = 0)}{\mathbb{P}(X = 0)} = \frac{0.17}{0.6} \approx 28.33 \% \text{,} \] and \[ \mathbb{P}(Y = 0 \mid X = 1) = \frac{\mathbb{P}(Y = 0 {\color{blue}{\cap}} X = 1)}{\mathbb{P}(X = 1)} = \frac{0.12}{1-0.6} \approx 30 \% \text{,} \] Then, it is possible to express the marginal probability of \(X\) as: \[ \begin{aligned} \mathbb{P}(X = 0) & {} = \mathbb{E}\{\mathbb{P}(X = 0 \mid Y)\} = \\ & = \mathbb{P}(X = 0 \mid Y = 0) \mathbb{P}(Y = 0) + \mathbb{P}(X = 0 \mid Y = 1) \mathbb{P}(Y = 1) = \\ & = 0.5863 \cdot 0.29 + 0.6056 \cdot (1 - 0.29) \approx 60 \% \end{aligned} \] And similarly for \(Y\) \[ \begin{aligned} \mathbb{P}(Y = 0) & {} = \mathbb{E}\{\mathbb{P}(Y = 0 \mid X)\} = \\ & = \mathbb{P}(Y = 0 \mid X = 0) \mathbb{P}(X = 0) + \mathbb{P}(Y = 0 \mid X = 1) \mathbb{P}(X = 1) = \\ & = 0.2833 \cdot 0.6 + 0.30 \cdot (1 - 0.6) \approx 29 \% \end{aligned} \]

Exercise 6.2 (\(\color{magenta}{\textbf{Monty Hall Problem}}\))
You are on a game show where there are three closed doors: behind one is a car (the prize) and behind the other two are goats.

The rules are simple:

  1. You choose one door (say Door 1).
  2. The host, who knows where the car is, opens one of the other doors, always revealing a goat.
  3. You are then offered the chance to stay with your original choice or switch to the other unopened door.

Question: Is in your interests to switch door? (See 21 Blackjack)

Solution 6.2. Before any door is opened, the probability that the car is behind each door is \[ \mathbb{P}(\text{car behind door 1}) = \mathbb{P}(\text{car behind door 2}) = \mathbb{P}(\text{car behind door 3}) = \frac{1}{3} = 33.\bar{3} \% \text{.} \] Suppose you picked Door 1. The conductor opens (say) Door 3, revealing a goat. Now the conditional probabilities are:

  • If the car is behind Door 1: Monty could open either Door 2 or Door 3 with equal probability.

  • If the car is behind Door 2: Monty is forced to open Door 3.

  • If the car is behind Door 3: Monty is forced to open Door 2 (so this case is impossible if Monty opens Door 3).

Apply Bayes’ Rule: \[ \mathbb{P}(\text{car behind door 1} \mid \text{Monty opens door 3}) = \frac{\tfrac{1}{3}\cdot \tfrac{1}{2}}{\tfrac{1}{3}\cdot\tfrac{1}{2}+\tfrac{1}{3}\cdot 1} = \frac{1/6}{1/6+1/3} = \tfrac{1}{3} = 33.\bar{3} \% \text{.} \] On the other hand, the other door has probabvility of winning of \[ \mathbb{P}(\text{car behind door 2} \mid \text{Monty opens door 3}) = \frac{\tfrac{1}{3}\cdot 1}{\tfrac{1}{3}\cdot\tfrac{1}{2}+\tfrac{1}{3}\cdot 1} = \frac{1/3}{1/6+1/3} = \tfrac{2}{3} = 66.\bar{6} \% \text{.} \] After Monty opens a goat door, the probability the car is behind your original choice is still 33\(\%\), while the probability it is behind the other unopened door is 66\(\%\), almost double . Therefore, switching doubles the chances of winning.

6.2.1 Conditional variance

Proposition 6.3 (\(\color{magenta}{\textbf{Conditional variance}}\))
Let’s consider two random variable \(X\) and \(Y\) with finite second moment. Then, the total variance can be expressed as: \[ \mathbb{V}\{X\} = \mathbb{E}\{\mathbb{V}\{X \mid Y\}\} + \mathbb{V}\{\mathbb{E}\{X \mid Y\}\} \iff \mathbb{V}\{Y\} = \mathbb{E}\{\mathbb{V}\{Y \mid X\}\} + \mathbb{V}\{\mathbb{E}\{Y \mid X\}\} \]

Proof. By definition, the variance of a random variable reads: \[ \mathbb{V}\{X\} = \mathbb{E}\{X^2\} - \mathbb{E}\{X\}^2 \text{.} \] Applying the chain rule (Equation 6.1) one can write \[ \mathbb{V}\{X\} = \mathbb{E}\{\mathbb{E}\{X^2 \mid Y\}\} - \mathbb{E}\{\mathbb{E}\{X\mid Y\}\}^2 \text{.} \] Then, add and subtract \(\mathbb{E}\{\mathbb{E}\{X \mid Y\}^2\}\) gives \[ \mathbb{V}\{X\} = \mathbb{E}\{\mathbb{E}\{X^2 \mid Y\}\} - \mathbb{E}\{\mathbb{E}\{X\mid Y\}\}^2 {\color{green}{+ \mathbb{E}\{\mathbb{E}\{X \mid Y\}^2\}}} {\color{red}{- \mathbb{E}\{\mathbb{E}\{X \mid Y\}^2\}}} \text{.} \] Group the first and fourth terms and the second and third to obtain \[ \mathbb{V}\{X\} = \mathbb{E}\{\mathbb{E}\{X^2 \mid Y\} - {\color{red}{\mathbb{E}\{X \mid Y\}^2}}\} + {\color{green}{\mathbb{E}\{\mathbb{E}\{X \mid Y\}^2\}}} - \mathbb{E}\{\mathbb{E}\{X\mid Y\}\}^2 \text{,} \] that simplifies as \[ \mathbb{V}\{X\} = \mathbb{E}\{\mathbb{V}\{X \mid Y\}\} + \mathbb{V}\{\mathbb{E}\{X \mid Y\}\} \text{.} \]