5  Expectation

Reference: Chapter 5. Resnick ().

The expectation represents a central value of a random variable and has a measure theory counterpart as a Lebesgue-Stieltjes integral of X with respect to a (probability) measure P. This kind of integration is defined in steps. First it is shown the integration of simple functions and then extended to more general random variables.

Let’s define a probability space (Ω,B,P) and a generic random variable X such that X:(Ω,B)(R¯,B(R¯)), where R¯=[,]. Then, the expectation of X is denoted as: E{X}=ΩXdP=ΩX(ω)P(dω), as the Lebesgue-Stieltjes integral of X with respect to the (probability) measure P.

5.1 Simple functions

Let’s start from the definition of the expectation for a restricted class of random variables called simple functions. Generally speaking, a random variable X(ω) is called simple if it has a finite range.

Formally, let’s consider a probability space (Ω,B,P) and a B/B(R)-measurable simple function X:ΩR defined as follows (5.1)X(ω)=i=1nai1Ai(ω), where aiR and AiB is a disjoint partition of the sample space, i.e. i=1nAi=Ω.

Let’s denote the set of all simple functions on Ω as E. In this settings, E is a vector space, that satisfies three main properties.

  1. Constant: given a simple function XE, then αXE. In fact: (5.2)αX=i=1nαai1Ai=i=1nai1AiE, where ai=αai.

  2. Linearity: given two simple function X,YE, then X+YE. In fact: (5.3)X+Y=i=1nai1Ai+j=1mbj1Bj==i=1nj=1m(ai+bj)1Ai1Bj==i=1nj=1m(ai+bj)1AiBj where the sequence of sets {AiBj1inand1jm} form a disjoint partition of Ω.

  3. Product: given two simple function X,YE, then XYE. In fact: (5.4)XY=i=1nai1Aij=1mbj1Bj==i=1nj=1m(aibj)1Ai1Bj==i=1nj=1m(aibj)1AiBj

5.1.1 Expectation of simple functions

The expectation of a simple function X is defined as: (5.5)E{X}=ΩXdP=i=1naiP(Ai), where |ai|<.

  1. Expectation of an indicator function: we have that E{1A(ω)}=P(X(ω)A)=P(A)

  2. Non-negativity: If X0 and XE then E{X}0

  3. Linearity: the expectation of simple function is linear, i.e.  E{αX+βY}=αE{X}+βE{Y}.

  4. Monotonicity: the expectation of simple function is monotone on E, in the sense that if two random variables X,YE are such that XY, then E{X}E{Y}.

  5. Continuity: the expectation of simple function is continuous on E, in the sense that, if Xn,XE and either XnX or XnX, then E{Xn}E{X}E{Xn}E{X}

Proof. Let’s consider two simple functions, i.e.  X(ω)=i=1nai1Ai(ω)andY(ω)=j=1mbj1Bj(ω), and let’s fix α,βR. Then, by the second property of the vector space E () it is possible to write: αX+βY=i=1nj=1m(αai+βbj)1AiBj. Then, taking the expectation on both sides:
E{αX+βY}=i=1nj=1m(αai+βbj)P(AiBj)==i=1nαaij=1mP(AiBj)+j=1mβbji=1nP(AiBj) Fixing i, the sequence AiBj for j=1,,n is composed by disjoint events since by definition Bj are disjoint. Hence, applying σ-additivity it is possible to write: j=1mP(AiBj)=P(j=1mAiBj)==P(Ai(j=1mBj))==P(AiΩ)=P(Ai) Therefore, the expectation simplifies in: E{αX+βY}=i=1nαaiP(Ai)+j=1mβbjP(Bj)==αE{X}+βE{Y}

5.2 Extension of the definition

Simple functions are the building blocks in the definition of the expectation in terms of Lebesgue-Stieltjes integral. In fact a known theorem called Measurability theorem shows that any measurable function can be approximated by a sequence of simple functions.

Theorem 5.1 (Measurability theorem)
Suppose that X(ω)0 for all ωΩ. Then, X is B/B(R) measurable if and only if there exists simple functions XnE and 0XnXX=limnXn.

5.2.1 Non-negative random variables

We now extend the definition of the expectation to a broader class of random variables. Let’s define the set E+ as the set of non-negative simple functions and define: E¯+={X0:X:(Ω,B)(R¯,B(R¯))} to be the set of non-negative and measurable functions with domain Ω.

If P(X=)=0, then the expectation E{X}< and by we can find an XnE+ such that 0XnX. Since the expectation operator preserves monotonicity, also the sequence E{Xn} is non-decreasing. Thus, since the limit of monotone sequences always exists, then E{X}=limnE{Xn} thus extending the definition of expectation from E to E¯+.

5.2.2 Integrable random variables

Finally, we extend the definition of expectation to all random variables such that E{X}<. For any random variable X, let’s call X+=max(X,0)X=max(X,0) Therefore, X+=Xif X0X=Xif X0{X+0X0 Then, we define a new random variable |X|=X++X that is B/B(R)-measurable if both X+ and X are measurable. If at least one among E{X+} or E{X} is finite, then we define E{X}=E{X+}E{X} and we call X quasi integrable. Instead, if both E{X+}< and E{X}<, then E{|X|}< and we call X integrable. In this case, we write XL1, where L1 stands for the set of integrable random variables with first moment finite, i.e.  (5.6)L1={X:ΩR:X is a r.v. ,E{|X|}<}, In general, writing XLp, means that X belong to the set of integrable random variables with p-moment finite, i.e.  (5.7)Lp={X:ΩR:X is a r.v. ,E{|X|p}<}.

5.3 General definition

The expectation of a random variable X is it’s first moment, also called statistical average. In general, it is denoted as E{X}. Let’s consider a discrete random variable X with distribution function P(X=xj). Then, the expectation of X is the weighted average between all the possible m-states that the random variable can assume by it’s respective probability of occurrence, i.e.  E{X}=j=1mxjP(X=xj), that is exactly the . In continuous case, i.e. when X takes values in R and admits a density function fX, the expectation is computed as an integral, i.e. E{X}=xdFX=xfX(x)dx.

Definition 5.1 (Moments and central moments)
For any random variable X, let’s define the moment of order p1 as: (5.8)mp=E{Xp}, Similarly, for p2 we define the central moment as (5.9)μp=E{(Xm1)p}.

5.3.1 Variance and Covariance

In general the variance of a random variable in population defined as the second central moment (): (5.10)V{X}=E{(XE{X})2}=σ2.

Let’s consider a discrete random variable X with distribution function P(X=xj)=pj. Then the variance of X is the weighted average between all the possible m-centered and squared states that the random variable can assume by it’s respective probability of occurrence, i.e.  V{X}=j=1m(xjE{X})2pj. In the continuous case, i.e. when X admits a density function and takes values in R, the expectation is computed as: V{X}=(xE{X})2fX(x)dx.

Let’s consider two random variables X and Y. Then, in general their covariance is defined as: (5.11)Cv{X,Y}=E{(XE{X})(YE{Y})}=σXY.

In the discrete case where X and Y have a joint distribution P(X=xi,Y=yj)=pij, their covariance is defined as: Cv{X,Y}=i=1mj=1s(xiE{X})(yjE{Y})pij. In the continuous case, if the joint distribution of X and Y admits a density function the covariance is computed as: Cv{X,Y}=(xE{X})(yE{Y})fX,Y(x,y)dxdy.

Proposition 5.1 (Properties of the variance)
There are several properties connected to the variance.

  1. The variance can be computed in terms of the second and first moment of X, i.e. (5.12)V{X}=E{X2}E{X}2=μ2(μ1)2.

  2. The variance is invariant with respect to the addition of a constant a, i.e.  (5.13)V{a+X}=V{X}.

  3. The variance scales upon multiplication with a constant a, i.e.  (5.14)V{aX}=a2V{X}.

  4. The variance of the sum of two correlated random variables is computed as: (5.15)V{X+Y}=V{X}+V{Y}+2Cv{X,Y}.

  5. The covariance can be expressed as:
    (5.16)Cv{X,Y}=E{XY}E{X}E{Y}.

  6. The covariance scales upon multiplication with a constant a and b, i.e.  (5.17)Cv{aX,bY}=abC{X,Y}.

Proof. The property 1. () follows easily developing the definition of variance, i.e.  V{X}=E{(XE{X})2}==E{X2}+E{X}22E{X}2==E{X2}E{X}2 The property 2. () follows from the definition, i.e.  V{a+X}=E{(a+XE{a+X})2}==E{(XE{X})2}==V{X} The property 3. () follows using the expression of the variance in , i.e. V{aX}=E{(aX)2}E{aX}2==a2E{X2}a2E{X}2==a2(E{X2}E{X}2)==a2V{X} The property 4. (), i.e. the variance of the sum of two random variables is: V{X+Y}=E{(X+YE{X+Y})2}==E{([XE{X}]+[YE{Y}])2}==E{(XE{X})2}+E{(YE{Y})2}+2E{(XE{X})(YE{Y})}==V{X}+V{Y}+2Cv{X,Y} where in the case in which there is no linear connection between X and Y the covariance is zero, i.e. Cv{X,Y}=0. Developing the computation of the covariance it is possible to prove property 5. (), i.e.  Cv{X,Y}=E{(XE{X})(YE{Y})}==E{XYXE{Y}YE{X}+E{X}E{Y}}==E{XY}2E{X}E{Y}+E{X}E{Y}==E{XY}E{X}E{Y} Finally, using the result in property 5. () the result in property 6. () follows easily: Cv{aX,bY}=E{aXbY}E{aX}E{bY}==abE{XY}abE{X}E{Y}==abCv{X,Y}

5.3.2 Skewness

The skewness is a measure of the asymmetry of the probability distribution of a real-valued random variable about its mean. The skewness value can be positive, zero, negative, or undefined. For a uni modal distribution, negative skew commonly indicates that the tail is on the left side of the distribution, and positive skew indicates that the tail is on the right.

Figure 5.1: A left (red), symmetric (black) and right (green) skewed density functions.

Following the same notation as in Ralph B. D’agostino and Jr. (), let’s define and denote the population skewness of a random variable X as: (5.18)Sk{X}=β1(X)=E{(XE{X}V{X})3}=μ3σ3,

5.3.3 Kurtosis

The kurtosis is a measure of the tailedness of the probability distribution of a real-valued random variable. The standard measure of a distribution’s kurtosis, originating with Karl Pearson is a scaled version of the fourth moment of the distribution. This number is related to the tails of the distribution. For this measure, higher kurtosis corresponds to greater extremity of deviations from the mean (or outliers). In general, it is common to compare the excess kurtosis of a distribution with respect to the normal distribution (with kurtosis equal to 3). It is possible to distinguish 3 cases:

  1. A negative excess kurtosis or platykurtic are distributions that produces less outliers than the normal. distribution.
  2. A zero excess kurtosis or mesokurtic are distributions that produces same outliers than the normal.
  3. A positive excess kurtosis or leptokurtic are distributions that produces more outliers than the normal.
Figure 5.2: A mesokurtic (black) distribution and different leptokurtic (red, green, blue) densities.

Let’s define and denote the kurtosis of a random variable X as: (5.19)Kt{X}=β2(X)=E{(XE{X}V{X})4}=μ4σ4, or equivalently the excess kurtosis as Kt{X}3.

5.4 Review of inequalities

Definition 5.2 (Markov Inequality)
Let’s consider a random variable XL1 (), then for all λ>0 the Markov’s inequality states that P(|X|λ)E{|X|}λ. Hence, this inequality produce an upper bound for a certain tail probability 1P(|X|λ) by using only the first moment of X.

Definition 5.3 (Chebychev Inequality)
Let’s consider a random variable XL2 (), i.e. with first and second moment finite E{|X|}<,E{X2}<, then for all λ>0 the Chebychev inequality states that (5.20)P(Xλ)1λ2E{X2}. As with the Markov’s inequality, this one also produces an upper bound for a certain tail probability 1P(|X|λ), but by using only the second moment of X.

Definition 5.4 (Modulus Inequality)
Let’s consider XL1 (), then the modulus inequality states that: |E{X}|E{|X|},

Definition 5.5 (Holder Inequality)
Let’s consider two numbers p and q such that p>1,q>1,1p+1q=1, and let’s consider two random variables X and Y such that: E{|X|p}<,E{|Y|q}<. Then, (5.21)|E{XY}|E{|XY|}(E{|X|p})1p(E{|Y|q})1q.

Definition 5.6 (Schwartz Inequality)
Consider two random variables X,YL2, i.e. with first and second moment finite, i.e.  E{|X|}<,E{X2}<. Then (5.22)|E{XY}|E{|XY|}E{X2}E{Y2}. Note that this is a special case of Holder inequality () with p=q=2.

Definition 5.7 (Jensen Inequality)
Let’s consider a convex function u:RR. Suppose that E{X}< and E{|u(X)|}<, then (5.23)E{u(X)}u(E{X}), if u is concave the results revert, i.e.  (5.24)E{u(X)}u(E{X}).