# Probability

## Probability

Cover: The exponential distribution

## Axioms of Probability

We start with the axioms of probability. The reason is that everything else can be derived from the axioms, so it’s important to know the basics well.

The sample space $\omega$ is a non-empty set of outcomes, and the event space $\mathcal{E}$ be a set containing some subsets of $\omega$.

1. $A \in \mathcal{E} \quad \Rightarrow \quad A^{c} \in \mathcal{E}$
2. $A_{1}, A_{2}, \ldots \in \mathcal{E} \quad \Rightarrow \quad \bigcup_{i=1}^{\infty} A_{i} \in \mathcal{E}$
3. $\mathcal{E} \text { is non-empty }$.

If $\mathcal{E}$ satisfies these properties, then $(\omega, \mathcal{E})$ is said to be a measurable space.

Using measure theory, (which we will gladly skip over), a function $P: \mathcal{E} \to [0, 1]$ satisfies the axioms of probability if

1. $P(\Omega)=1$
2. $A_{1}, A_{2}, \ldots \in \mathcal{E}, A_{i} \cap A_{j}=\varnothing \forall i, j \quad \Rightarrow \quad P\left(\cup_{i=1}^{\infty} A_{i}\right)=\sum_{i=1}^{\infty} P\left(A_{i}\right)$

$P$ is called a probability distribution. The tuple is called $(\Omega, \mathcal{E}, P)$ is called a probability space.

One can prove a bunch of facts using the axioms such as

$P(A) = 1 - P(A^c)$$P ( A \cup B) = P (A) + P(B) \quad (A, B \text{ mutually exclusive})$$P ( A \cap B) = P (A)P(B) \quad (A, B \text{ independent})$

We will get to the generalizations of these rules soon!

## Probability distributions

Almost everytime we will pick $\mathcal{E} = \mathcal{P}(\Omega)$, where $\mathcal{P}$ is the powerset.

### Probability mass functions

For $\Omega = \text { discrete sample space }$ and $\mathcal{E}=\mathcal{P}(\Omega)$, a distribution then only needs to satisfy

1. $p: \Omega \rightarrow[0,1]$

2. $\sum_{\omega \in \Omega} p(\omega)=1$

The probability of any event $A \in \mathcal{E}$ is then:

$P(A)=\sum_{\omega \in A} p(\omega)$

such a probability distribution is called a probability mass function.

Examples of probability mass functions:

• Bernoulli distribution: $\Omega=\{S, F\} \quad \alpha \in(0,1)$

$p(\omega)=\left\{\begin{array}{ll}{\alpha} & {\omega=S} \\ {1-\alpha} & {\omega=F}\end{array}\right.$

or alternatively if we pick $\Omega=\{0,1\}$,

$p(k)=\alpha^{k} \cdot(1-\alpha)^{1-k} \quad \forall k \in \Omega$
• Poisson distribution: $\Omega=\{0,1, \ldots\} \lambda \in(0, \infty)$

$p(k)=\frac{\lambda^{k} e^{-\lambda}}{k !} \quad \forall k \in \Omega$

### Probability density functions

For $\Omega = \text { continuous sample space }$ and $\mathcal{E}=\mathcal{B}(\Omega)$, ($\mathcal{B}$ is the Borel field, which we will not get into, but you can think of like the continuous analogue of a power set operation, defined so that everything works) a distribution then only needs to satisfy

1. $p: \Omega \rightarrow[0, \infty)$

2. $\int_{\Omega} p(\omega) d \omega=1$

The probability of any event $A \in \mathcal{E}$ is then:

$P(A)=\int_{A} p(\omega) d \omega$

such a probability distribution is called a probability density function.

In a probability mass function and $\Omega=\text { discrete sample space }$, we can always take about any singleton event $\{\omega\} \in \mathcal{E}, \omega \in \Omega$:

$P(\{\omega\})=p(\omega)$

But for $\Omega$ a continuous space, this makes no sense as singleton sets have measure zero. Hence whenever we are talking about probability density functions, we should always ask about the probability of some event in an interval.

For example, say the stopping time of a car, is in an interval $[3,15]$. What is the probability of seeing a stopping time of exactly 3.141596? (How much mass in [3,15]?) It’s much more reasonable to ask the probability of stopping between 3 to 3.5 seconds.

For a probability density function, the continuous analogoue is $\Omega=\text { continuous sample space }$,$A=[x, x+\Delta x]$, and for small $\Delta$,

\begin{aligned} P(A) &=\int_{x}^{x+\Delta x} p(\omega) d \omega \\ & \approx p(x) \Delta x \end{aligned}

Examples of probability density functions:

• Uniform distribution: $\Omega=[a, b]$

$p(\omega)=\frac{1}{b-a} \quad \forall \omega \in[a, b]$

• Gaussian distribution: $\Omega=\mathbb{R} \quad \mu \in \mathbb{R}, \sigma \in \mathbb{R}^{+}$

$p(\omega)=\frac{1}{\sqrt{2 \pi \sigma^{2}}} e^{-\frac{1}{2 \sigma^{2}}(\omega-\mu)^{2}} \quad \forall \omega \in \mathbb{R}$

• Exponential distribution: $\Omega=[0, \infty) \quad \lambda>0$

$p(\omega)=\lambda e^{-\lambda \omega} \quad \forall \omega \geq 0$

## Random Variables

The confusing thing about random variables is that they are neither random nor a variable.

A random variable is defined:

1. $X: \Omega \rightarrow \Omega_{X}$.
2. $\forall A \in \mathcal{B}\left(\Omega_{X}\right) \text { it holds that }\{\omega: X(\omega) \in A\} \in \mathcal{E}$.

To give an example, let’s say $\Omega$ is a set of people. Suppose we want to compute the probability that a randomly selected person $\omega \in \Omega$ has a cold.

Define, $A=\{\omega \in \Omega: \text { Disease }(\omega)=\text{cold}\}$. Disease is our new random variable, $P(\text{Disease} = \text{cold})$. Hence, $\text{Disease}$ is a function that maps one sample space $\Omega = \text {set of people}$ to another sample space $\Omega_X = \{\text { cold, not cold }\}$, which I’ll call the target space.

Essentially, the main point of a random variable is that it transforms one sample space into another. In practice instead of calling the target space the target space, we call it the random variable. That is, we sometimes refer to $\Omega_X$ as $X$.

### Multiple random variables

When we have two or more random variables, we have three extra distributions. Suppose the sample spaces are $\mathcal{X}, \mathcal{Y}$.

The first of which is the joint distribution, or multivariate distribution. A random variable becomes a random vector, $\boldsymbol{X}=\left(X_{1}, X_{2}, \ldots, X_{d}\right)$ with vector-valued outcomes $\boldsymbol{x}=\left(x_{1}, x_{2}, \ldots, x_{d}\right)$, and each $x_i$ comes from $\mathcal{X}_{i}$.

To make things simple, for two variables, there is a $P$ such that

$p(x, y) \stackrel{\text { def }}{=} P(X=x, Y=y)$

and

$\sum_{x \in \mathcal{X}} \sum_{y \in \mathcal{Y}} p(x, y)=1$

It is easy to generalize the above. The multivariate joint distribution must satisfy:

1. $p: \mathcal{X}_{1} \times \mathcal{X}_{2} \times \ldots \times \mathcal{X}_{d} \rightarrow[0,1]$

2. For discrete variables:

$\sum_{x_{1} \in \mathcal{X}_{1}} \sum_{x_{2} \in \mathcal{X}_{2}} \cdots \sum_{x_{d} \in \mathcal{X} d} p\left(x_{1}, x_{2}, \ldots, x_{d}\right)=1$

For continuous variables:

$\int_{\mathcal{X}_{1}} \int_{\mathcal{X}_{2}} \cdots \int_{\mathcal{X} d} p\left(x_{1}, x_{2}, \ldots, x_{d}\right) d x_{1} d x_{2} \ldots d x_{d}=1$

If we know the joint distribution, we also get two more distributions for free. The next distribution is the marginal distribution. It is defined for a subset of $\boldsymbol{X}=\left(X_{1}, X_{2}, \ldots, X_{d}\right)$ by summing or integrating the remaining variables.

For the discrete case:

$p(x) = \stackrel{\text { def }}{=} \sum_{y \in \mathcal{Y}}p(x, y)$

For the continuous case:

$p(x) = \stackrel{\text { def }}{=} \int_{y \in \mathcal{Y}}p(x, y)$

This is how $p(x)$ and $p(y)$ are related to $p(x, y)$: they are the marginal distributions of the joint distribution $p(x, y)$.

We can easily extend this to more than two variables. For the discrete case:

$p\left(x_{i}\right) \stackrel{\text { def }}{=} \sum_{x_{1} \in \mathcal{X}_{1}} \cdots \sum_{x_{i-1} \in \mathcal{X}_{i-1} x_{i+1} \in \mathcal{X}_{i+1}} \cdots \sum_{x_{d} \in \mathcal{X}_{d}} p\left(x_{1}, \ldots, x_{i-1}, x_{i}, x_{i+1}, \ldots, x_{d}\right)$

For the continuous case:

$p\left(x_{i}\right) \stackrel{\text { def }}{=} \int_{\mathcal{X}_{1}} \cdots \int_{\mathcal{X}_{i-1}} \int_{\mathcal{X}_{i+1}} \cdots \int_{\mathcal{X}_{d}} p\left(x_{1}, \ldots, x_{i-1}, x_{i}, x_{i+1}, \ldots, x_{d}\right) d x_{1} \ldots d x_{i-1} d x_{i+1} \ldots d x_{d}$

### Conditional distributions

To make things simple, let’s say we are dealing with two random variables only. From the joint distribution $p(x, y)$ we can define the conditional distribution:

$p(y | x) \stackrel{\text { def }}{=} \frac{p(x, y)}{p(x)}$

As before for events, we must integrate the distribution.

$P(Y \in A | X=x)=\left\{\begin{array}{ll}{\sum_{y \in A} p(y | x)} & {Y: \text { discrete }} \\ {\int_{A} p(y | x) d y} & {Y: \text { continuous }}\end{array}\right.$

$p(x, y)=p(x | y) p(y)=p(y | x) p(x)$

which is known as the product rule. The generalization to multiple variables is straightforward:

$p\left(x_{1}, \ldots, x_{d}\right)=p\left(x_{d} | x_{1}, \ldots, x_{d-1}\right) p\left(x_{1}, \ldots, x_{d-1}\right)$

and recursively applying the product rule, we obtain:

\begin{aligned} p\left(x_{1}, \ldots, x_{d}\right) &=p\left(x_{d} | x_{1}, \ldots, x_{d-1}\right) p\left(x_{1}, \ldots, x_{d-1}\right) \\ &=p\left(x_{d} | x_{1}, \ldots, x_{d-1}\right) p\left(x_{d-1} | x_{1}, \ldots, x_{d-2}\right) p\left(x_{1}, \ldots, x_{d-2}\right) \\ & \vdots \\ &=p\left(x_{d} | x_{1}, \ldots, x_{d-1}\right) p\left(x_{d-1} | x_{1}, \ldots, x_{d-2}\right) \ldots p\left(x_{2} | x_{1}\right) p\left(x_{1}\right) \end{aligned}

which can be written in the more compact form

$p\left(x_{1}, \ldots, x_{d}\right)=p\left(x_{1}\right) \prod_{i=2}^{d} p\left(x_{i} | x_{1}, \ldots, x_{i-1}\right)$

known as the general product rule.

Now, using the definition of conditional probability:

$p(x, y)=p(x | y) p(y)=p(y | x) p(x)$

$p(x | y)=\frac{p(y | x) p(x)}{p(y)}$

### Independent random variables

If the variables are independent if the joint distribution can be written as the product of marginal distributions

$p(x, y) = p(x)p(y)$

The reason is that if

$p(x | y)=p(x)$

then it implies that the value of $Y$ does not affect the distribution of $x$, so the variables are independent.

For more than two variables, the distribution factors neatly into the components:

$p\left(x_{1}, x_{2}, \ldots, x_{d}\right)=p\left(x_{1}\right) p\left(x_{2}\right) \ldots p\left(x_{d}\right)$

There is one more form of independence. Two variables are conditionally independent if in the presence of a third variable, they are independent:

$p(x, y | z)=p(x | z) p(y | z)$

the two forms of independence are unrelated: neither one implies the other.

## Expected value

The expected value or mean of a random variable $X$ is the average of a repeatedly sampled $X$. It’s defined as:

$\mathbb{E}[X] \stackrel{\text { def }}{=}\left\{\begin{array}{ll}{\sum_{x \in \mathcal{X}} x p(x)} & {\text { if } X \text { is discrete }} \\ {\int_{\mathcal{X}} x p(x) d x} & {\text { if } X \text { is continuous }}\end{array}\right.$

The expected value of $f(X)$ is then:

$\mathbb{E}[f(X)]=\left\{\begin{array}{ll}{\sum_{x \in \mathcal{X}} f(x) p(x)} & {\text { if } X \text { is discrete }} \\ {\int_{\mathcal{X}} f(x) p(x) d x} & {\text { if } X \text { is continuous }}\end{array}\right.$

Doing conditional expectations is very similar. Since $X$ is fixed, we have

$\mathbb{E}[Y | X=x]=\left\{\begin{array}{ll}{\sum_{y \in \mathcal{Y}} y p(y | x)} & {Y: \text { discrete }} \\ {\int_{\mathcal{Y}} y p(y | x) d y} & {Y: \text { continuous }}\end{array}\right.$

For multivariate distributions:

$\mathbb{E}[\boldsymbol{X}]=\left\{\begin{array}{ll}{\sum_{\boldsymbol{x} \in \mathcal{X}} \boldsymbol{x} p(\boldsymbol{x})} & {\boldsymbol{X}: \text { discrete }} \\ {\int_{\mathcal{X}} \boldsymbol{x} p(\boldsymbol{x}) d \boldsymbol{x}} & {\boldsymbol{X}: \text { continuous }}\end{array}\right.$

One useful application is called the variance. If we pick $f(X) = (X - \mathbb{E}[X])$, then we have

\begin{aligned} \operatorname{Var}(X) &=\mathbb{E}\left[(X-\mathbb{E}[X])^{2}\right] \\ &=\mathbb{E}\left[X^{2}-2 X \mathbb{E}[X]+\mathbb{E}[X]^{2}\right] \\ &=\mathbb{E}\left[X^{2}\right]-2 \mathbb{E}[X] \mathbb{E}[X]+\mathbb{E}[X]^{2} \\ &=\mathbb{E}\left[X^{2}\right]-\mathbb{E}[X]^{2} \end{aligned}

Then, for multivariate distributions, we can define a multivariate variance called the covariance:

\begin{aligned} \operatorname{Cov}[X, Y] &=\mathbb{E}[(X-\mathbb{E}[X])(Y-\mathbb{E}[Y])] \\ &=\mathbb{E}[X Y]-\mathbb{E}[X] \mathbb{E}[Y] \end{aligned}

We can use this to measure how correlated two variables are,

$\operatorname{Corr}[X, Y]=\frac{\operatorname{Cov}[X, Y]}{\sqrt{V[X] \cdot V[Y]}}$

Lastly, covariance for more than two variables. If $\boldsymbol{X}=\left[X_{1}, \ldots, X_{d}\right]$

\begin{aligned} \Sigma_{i j} &=\operatorname{Cov}\left[X_{i}, X_{j}\right] \\ &=\mathbb{E}\left[\left(X_{i}-\mathbb{E}\left[X_{i}\right]\right)\left(X_{j}-\mathbb{E}\left[X_{j}\right]\right)\right] \end{aligned}

Using matrix notation, this becomes

\begin{aligned} \boldsymbol{\Sigma} &=\operatorname{Cov}[\boldsymbol{X}, \boldsymbol{X}] \in \mathbb{R}^{d \times d} \\ &=\mathbb{E}\left[(\boldsymbol{X}-\mathbb{E}[\boldsymbol{X}])\left(\boldsymbol{X}-\mathbb{E}(\boldsymbol{X})^{\top}\right]\right.\\ &=\mathbb{E}\left[\boldsymbol{X} \boldsymbol{X}^{\top}\right]-\mathbb{E}[\boldsymbol{X}] \mathbb{E}[\boldsymbol{X}]^{\top} \end{aligned}

Note that here we are using the outer product. If the dot product or inner product is is defined by:

$\mathbf{x}^{\top}\mathbf{y}= \begin{bmatrix}x_1 & x_2 & \ldots & x_d \end{bmatrix}\begin{bmatrix}x_1 \\ x_2 \\ \vdots \\x_d \end{bmatrix}=\sum_{i=1}^{d} x_{i} y_{i}$

Then the outer product is

$\mathbf{x y}^{\top}=\left[\begin{array}{cccc}{x_{1} y_{1}} & {x_{1} y_{2}} & {\dots} & {x_{1} y_{d}} \\ {x_{2} y_{1}} & {x_{2} y_{2}} & {\dots} & {x_{2} y_{d}} \\ {\vdots} & {\vdots} & {} & {\vdots} \\ {x_{d} y_{1}} & {x_{d} y_{2}} & {\dots} & {x_{d} y_{d}}\end{array}\right]$

As an example, suppose we have only two random variables $X$ and $Y$. What is the covariance matrix?

$\boldsymbol{\Sigma} = \operatorname{Cov}[\mathbf{X}, \mathbf{X}] = \mathbb{E}\left[\boldsymbol{X} \boldsymbol{X}^{\top}\right]-\mathbb{E}[\boldsymbol{X}] \mathbb{E}[\boldsymbol{X}]^{\top}$$= \mathbb{E}\left[\begin{array}{cc}{X_{1}^{2}} & {X_{1} X_{2}} \\ {X_{2} X_{1}} & {X_{2}^{2}}\end{array}\right]-\left[\begin{array}{cc}{\mathbb{E}\left[X_{1}\right]^{2}} & {\mathbb{E}\left[X_{1}\right] \mathbb{E}\left[X_{2}\right]} \\ {\mathbb{E}\left[X_{2}\right] \mathbb{E}\left[X_{1}\right]} & {\mathbb{E}\left[X_{2}\right]^{2}}\end{array}\right]$

Finally, we conclude with some useful properties of expectation values and variances. Note that $\mathrm{V}[\mathbf{X}]$ is short for $\operatorname{Cov}[\boldsymbol{X}, \boldsymbol{X}]$.

1. $\mathbb{E}[c \boldsymbol{X}]=c \mathbb{E}[\boldsymbol{X}]$

2. $\mathbb{E}[\boldsymbol{X}+\boldsymbol{Y}]=\mathbb{E}[\boldsymbol{X}]+\mathbb{E}[\boldsymbol{Y}]$

3. $\mathrm{V}[c]=0$ The variance of a constant is zero

4. $\mathrm{V}[\boldsymbol{X}] \succeq 0$ (the matrix is positive semi-definite)

5. $\mathrm{V}[c \boldsymbol{X}]=c^{2} \mathrm{V}[\boldsymbol{X}]$

6. $\operatorname{Cov}[\boldsymbol{X}, \boldsymbol{Y}]=\mathbb{E}\left[(\boldsymbol{X}-\mathbb{E}[\boldsymbol{X}])\left(\boldsymbol{Y}-\mathbb{E}(\boldsymbol{Y})^{\top}\right]=\mathbb{E}\left[\boldsymbol{X} \boldsymbol{Y}^{\top}\right]-\mathbb{E}[\boldsymbol{X}] \mathbb{E}[\boldsymbol{Y}]^{\top}\right.$

7. $\mathrm{V}[\boldsymbol{X}+\boldsymbol{Y}]=\mathrm{V}[\boldsymbol{X}]+\mathrm{V}[\boldsymbol{Y}]+2 \operatorname{Cov}[\boldsymbol{X}, \boldsymbol{Y}]$

## Multivariate distributions

The most important multivariate distribution is the multivariate Gaussian distribution $\mathcal{N}(\omega | \mu, \Sigma)$ or sometimes written as $\mathcal{N}(\boldsymbol{\mu}, \mathbf{\Sigma})$, with $\Omega = \mathbb{R}^d$.

$p(\boldsymbol{\omega}) \stackrel{\text { def }}{=} \frac{1}{\sqrt{(2 \pi)^{d}|\mathbf{\Sigma}|}} \exp \left(-\frac{1}{2}(\boldsymbol{\omega}-\boldsymbol{\mu})^{\top} \boldsymbol{\Sigma}^{-1}(\boldsymbol{\omega}-\boldsymbol{\mu})\right)$

$|\boldsymbol{\Sigma}|$ is the determinant of the covariance matrix, which by a theorem in linear algebra is equal to the product of all the eigenvalues of $\boldsymbol{\Sigma}$.