Probability and Statistics Refresher

Introduction

We’ll review the essentials of probability and statistics that we’ll need for this course.

Motivation

Why do we need knowledge of probability and statistics as a data scientist?

  1. Data Analysis: Helps in understanding and interpreting data. Statistical methods allow you to summarize data (mean and variance) and identify patterns.
  2. Model Building: Many machine learning algorithms are based on statistical principles. For example, linear regression and logistic regression rely on probability and statistics.
  3. Uncertainty Quantification: Probability helps in quantifying uncertainty. This is crucial in making predictions and decisions based on data, as it allows you to estimate the likelihood of different outcomes.
  4. Hypothesis Testing: Statistics provide tools for testing hypotheses and validating models. This is essential for determining whether the patterns observed in data are significant or just due to random chance.

Combined with linear algebra, probability and statistics provide the theoretical foundation and practical tools needed to extract meaningful insights from data and make data-driven decisions.

Probability

Probability

We all have a general sense of what is meant by probability. Probability is the study of randomness. It is the branch of mathematics that analyzes the chances (likelihood) of random events.

There are two different ways in which probability is interpreted. These views are

  • frequentist, and
  • Bayesian.

The Frequentist View

The frequentist view of probability requires the definition of several concepts.

  • Experiment: a situation in which the outcomes occur randomly.

    Example: Driving to work, a commuter passes through a sequence of three intersections with traffic lights. At each light she either stops \(s\) or continues \(c\).

  • Sample space: the set of all possible outcomes of the experiment.

    Example: \(\{ccc, ccs, csc, css, scc, ssc, scs, sss\},\) where \(csc\), for example, denotes the outcome that the commuter continues through the first light, stops at the second light, and continues through the third light.

  • Event: a subset of the sample space.

    Example: continuing through the first light (i.e., \(\{ccc, ccs, csc, css\}\)).

The frequentist view is summarized by this quote from the first pages of Y. A. Rozanov. Probability Theory: A Concise Course. 1969.

Suppose an experiment under consideration can be repeated any number of times, so that, in principle at least, we can produce a whole series of “independent trials under identical conditions” in each of which, depending on chance, a particular event \(A\) of interest either occurs or does not occur.

Let \(n\) be the total number of experiments in the whole series of trials, and let \(n(A)\) be the number of experiement in which \(A\) occurs. Then the ratio \(n(A)/n\) is called the relative frequency of the event \(A.\)

It turns out that the relative frequencies observed in different series of trials are virtually the same for large \(n,\) clustering about some constant \(P(A),\) which is called the probability of the event \(A.\)

This is called the frequentist intepretation of probability.

The key idea in the above definition is to be able to:

produce a whole series of “independent trials under identical conditions”

Which, when you think about it, is really a rather tricky notion.

Nevertheless, the frequentist view of probability is quite useful and we will often use it in this course.

You can think of it as treating each event as a sort of idealized coin-flip.

In other words, when we use the frequentist view, we will generally be thinking of a somewhat abstract situation where we assume that “independent trials under identical conditions” is a good description of the situation.

The Bayesian View

To understand the Bayesian view of probability, consider the following situations.

On a visit to the doctor, we may ask, “What is the probability that I have disease X?”

Or, before digging a well, we may ask, “What is the probability that I will strike water?”

These questions are totally incompatible with the notion of “independent trials under identical conditions”!

Either I do, or do not, have disease X. Either I will, or will not, strike water.

What we are really asking is

  • “How certain should I be that I have disease X?”
  • “How certain should I be that I will strike water?”

In this setting, we are using probability to encode “degree of belief” or a “state of knowledge.”

This is called the Bayesian interpretation of probability.

Somewhat amazingly, it turns out that whichever way we think of probability (frequentist or Bayesian),

… the rules that we use for computing probabilities are exactly the same.

This is a very deep and surprising thing.

In other words, it’s often really a “state of knowledge” that we are really talking about when we use probability models in this course.

A thing appears random only through the incompleteness of our knowledge.

Spinoza, Ethics, Part 1

In other words, we use probability as an abstraction that hides details we don’t want to deal with.

So it’s important to recognize that both frequentist and Bayesian views are valid and useful views of probability.

Any simple idea is approximate; as an illustration, consider an object … what is an object? Philosophers are always saying, “Well, just take a chair for example.” The moment they say that, you know that they do not know what they are talking about any more. What is a chair? … every object is a mixture of a lot of things, so we can deal with it only as a series of approximations and idealizations.

The trick is the idealizations.

Richard Feynman, The Feynman Lectures on Physics, 12-2

Here is an illustration of this principle applied to probability:

In a serious work … an expression such as “this phenomenon is due to chance” constitutes simply, an elliptic form of speech. … It really means “everything occurs as if this phenomenon were due to chance,” or, to be more precise: “To describe, or interpret or formalize this phenomenon, only probabilistic models have so far given good results.”

Georges Matheron, Estimating and Choosing: An Essay on Probability in Practice

So, now, let’s talk about rules for computing probabilities.

Sample Space and Events

Definition. The sample space \(\Omega\) is the set of all possible outcomes of an experiment.

Definition. An event \(E\subset\Omega\) is an outcome, or a set of outcomes of an experiment.

Examples:

  • Rolling a dice: \(\Omega = \{1, 2, 3, 4, 5, 6 \}\), \(E = \text{Rolling a 2}.\)
  • Flipping a coin 2 times: \(\Omega = \{ (H, H), (H, T), (T, H), (T, T)\}\), \(E=\text{Rolling a heads and a tail}\).
  • Distance a car travels before breaking down: \(\Omega = \mathbb{R}_{+}\), \(E=\text{ Travels greater than 100 miles}.\)
  • Location of a dart thrown at a target. What is \(\Omega\) in this case? What is an event \(E\)?

The sample space \(\Omega\) may be continuous or discrete and bounded or unbounded.

Probability and Conditioning

Definition. Consider a sample space \(\Omega\). A probability measure on \(\Omega\) is a function \(P(\cdot)\) defined on all the subsets of \(\Omega\) (the events) such that:

  1. \(P(\Omega) = 1\)
  2. For any event \(A \subset \Omega\), \(P(A) \geq 0.\)
  3. For any events \(A, B \subset \Omega\) where \(A \cap B = \emptyset\), \(P(A \cup B) = P(A) + P(B)\).
  4. When \(A \cap B \neq \emptyset\), \(P(A \cup B) = P(A) + P(B) - P(A \cap B)\).

Often we want to ask how a probability measure changes if we restrict the sample space to be some subset of \(\Omega\).

This is called conditioning.

Definition. The conditional probability of an event \(A\) given that event \(B\) (having positive probability) is known to occur, is

\[ P(A|B) = \frac{P(A \cap B)}{P(B)}~\text{where}~P(B) > 0 \]

The function \(P(\cdot|B)\) is a probability measure over the sample space \(B\).

Note that in the expression \(P(A|B)\), \(A\) is random but \(B\) is fixed.

Now if \(B\) is a proper subset of \(\Omega,\) then \(P(B) < 1\). So \(P(\cdot|B)\) is a rescaling of the quantity \(P(A\cap B)\) so that \(P(B|B) = 1.\)

Probability Multiplication Rule

The probability multiplication rule is a simple consequence of the definition of conditional probability.

\[ P(A \cap B) = P(A|B) P(B) \]

This can be extended to three events \(A\), \(B\), and \(C\) as follows:

\[ P(A \cap B \cap C) = P(A|B \cap C) P(B \cap C) = P(A|B \cap C)P(B|C)P(C) \]

Law of total probability

An important tool for computing probabilities is provided by the law of total probabilities.

Let \(B_1\) and \(B_2\) form the complete sample space \(\Omega\) and be disjoint \(\left(B_1 \cap B_2 = \emptyset\right)\). Then for any event \(A\),

\[\begin{align*} P(A) &= P(A\cap B_1) + P(A \cap B_2) \\ &= P(A|B_1)P(B_1)+P(A|B_2)P(B_2) \end{align*}\]

Bayes’ Theorem

Bayes’ Theorem is a simple way of manipulating conditional probabilities in a way that can be very useful.

Let \(A,B\) be events, then

\[ P(A|B) = \frac{P(B|A)P(A)}{P(B)}. \]

The terms in the above equation are often given the following names:

  • \(P(A)\) is the prior probability of \(A\)initial belief before new evidence.
  • \(P(A|B)\) is the posterior probability of \(A\)updated belief after new evidence.
  • \(P(B|A)\) is the likelihood of \(B\)probability of observing evidence given a particular belief.
  • \(P(B)\) is the marginal probability of \(B\)total probability of observing the evidence under all possible causes.

We can extend this theorem to more than two events.

Start with a situation in which we are interested in two events, \(A_1\) and \(A_2\). Let \(B\) be any event in the samples space \(\Omega\).

These are exhaustive, meaning that in any experiment either \(A_1\) or \(A_2\) must occur. This means they form a partition of \(\Omega\), i.e., \(\Omega = A_1 \cup A_2\) and \(A_1 \cap A_2 = \emptyset\).

In this situation, Baye’s theorem is:

\[\begin{align*} P(A_1|B) &= \frac{P(A_1 \cap B)}{P(B)} \\ & = \frac{P(A_1 \cap B)}{P(B|A_1) + P(B|A_2)} \\ &= \frac{P(B|A_1)P(A_1)}{P(B|A_1) + P(B|A_2)}. \end{align*}\]

This is can be extended to the case where \(A_1, A_2, ..., A_n\) form a partition of \(\Omega\).

This formula is useful because often the probabilities \(P(B|A_i)\) can be estimated, while the probabilties \(P(A_i|B)\) may be hard to estimate.

Or perhaps \(P(B)\) itself (which is really the denominator) is easy to estimate.

We interpret this transformation as updating our estimate of the probability of each \(A_i\) based on new information, namely, that \(B\) is true.

This update transforms the prior probabilities \(P(A_i)\) into the posterior probabilities \(P(A_i|B)\).

Bayes’ Theorem Example

Empirical evidence suggests that amongs sets of twins, about 1/3 are identical.

Assume therefore that probability of a pair of twins being identical to be 1/3.

Now, consider how a couple might update this probability after they get an ultrasound that shows that the twins are of the same gender.

What is their new estimate of the probability that their twins are identical?

Let \(I\) be the event that the twins are identical. Let \(G\) be the event that gender is the same via ultrasound.

The prior probability here is \(P(I)\).

What we want to calculate is the posterior probability \(P(I|G)\).

First, we note:

\[P(G|I) = 1.\]

Surprisingly, people are sometimes confused about that fact!

We also assume that if the twins are not identical, they are like any two siblings. This means they have an equal probability of being the same gender, i.e.,

\[P(G|\bar{I}) = 1/2.\]

We know from observing the population at large that among all sets of twins, about 1/3 are identical

\[ P(I) = 1/3.\]

This statistic is easy obtained from data.

Then:

\[\begin{align*} P(I|G) &= \frac{P(G|I)P(I)}{P(G|I)P(I) + P(G|\bar{I})P(\bar{I})} \\ &= \frac{1 \cdot 1/3}{(1 \cdot 1/3) + (1/2 \cdot 2/3)} \\ &= \frac{1}{2} \end{align*}\]

So we have updated our estimate of the twins being identical from 1/3 (prior probability) to 1/2 (posterior probability).

We did this in a way that used quantities that were relatively easy to obtain or measure.

Independent Events

Definition. Two events \(A\) and \(B\) are independent if \(P(A\cap B) = P(A) \cdot P(B).\)


For example, consider rolling a die and flipping a coin. The result of the coin flip does not affect the result of the die roll. Therefore, the events are independent.

\[ \small P(\text{rolling 2 and flipping heads}) = P(\text{rolling 2}) \cdot P(\text{flipping heads}) = \frac{1}{6} \cdot \frac{1}{2} = \frac{1}{12}. \]


This is exactly the same as saying that \(P(A|B) = P(A).\) (Prove this to yourself using the definition of conditional probability.)


So we can see that the intuitive notion of independence is that “if one event occurs, that does not change the probability of the other event.”

Random Variables

Random Variables

When an experiment is performed, we might be interested in the actual outcome. However, more frequently we are interested in some function of the outcome. This leads us to the notion of a random variable.

Definition. A random variable \(X\) is a function \(X:\Omega\rightarrow \mathbb{R}\).

We generally use capital letters to denote random variables and lowercase to denote non-random quantities.

We distinguish between discrete and continuous random variables.

Discrete random variables

A discrete random variable is a random variable that can take on only a finite or at most a countably infinite number of values.

Examples

  • The number of points showing after a roll of a die.
  • The number of heads after flipping a coin twice.
  • The number of people arriving at a train station between 8 and 9 AM.

Continous random variables

A continuous random variable is a random variable that can take on any value from a given range. That is, a continuous random variable has an uncountable set of possible outcomes.

Examples

  • The lifetime of a light bulb.
  • The distance a car travels before breaking down.

Distributions

To collect information about what values of random variable are more probable than others, we introduce the concept of distributions.

We have two options to consider:

  • discrete distributions using the probability mass function (PMF), and
  • continuous distributions using the probability density function (PDF).

We will also introduce the cumulative distribution functions corresponding to both PMFs and PDFS.

Discrete Distributions

Probability Mass Function

Definition. For a discrete random variable \(X\), we define the probability mass function (PMF) \(p(a)\) of \(X\) by

\[p(a) = P(X=a).\]

The PMF \(p(a)\) is positive for at most a countable number of values of \(a\). That is, if \(X\) must assume one of the values \(x_1, x_2,...,\) then

\[\begin{align*} &p(x_i) \geq 0 \: \text{ for } i=1,2,... \\ &p(x) = 0 \: \text{ for all other values of } x. \end{align*}\]

Since \(X\) must take on one of the values \(x_i\), we have

\[\sum_{i=1}^{\infty} p(x_i) = 1.\]

Example. Consider the roll of a single die. The random variable here is the number of points showing. What is the PDF of this discrete random variable?

We assign equal probabilities to each outcome \(\Omega = \{1, 2, 3, 4, 5, 6\}\):

\[p(x_i) = \frac{1}{6}.\]

Code
plt.figure(figsize=(5, 3))
x = np.arange(1, 7)
plt.plot(x, 6*[1/6.], 'bo', ms=8)
plt.vlines(x, 0, 1/6., colors='b', lw=5, alpha=0.5)
plt.xlim([0.5, 6.5])
plt.ylim([0, 1.1])
plt.xlabel(r'x (Number of points showing)', size=14)
plt.ylabel(r'$P(X = x)$', size=14)
plt.show()

Discrete Cumulative Distribution Function

Definition. The cumulative distribution function (CDF) \(F\) can be expressed in terms of \(p(a)\) by

\[F(a) = P(X \leq a) = \sum_{x\leq a} p(x).\]

If \(X\) is a discrete random variable whose possible values are \(x_1 , x_2 , x_3 ,...\), where \(x_1 < x_2 < x_3 <...\), then the distribution function \(F\) of \(X\) is a step function. That is, the value of \(F\) is constant in the intervals \([x_{i−1},x_i)\) and then takes a step (or jump) of size \(p(x_i)\) at \(x_i\).

Example. Let us return to the roll of a single die. The corresponding CDF is shown below.

Code
plt.figure()
for i in range(7):
    plt.plot([i, i+1-.08], [i/6, i/6],'-b')
for i in range(1, 7):
    plt.plot(i, i/6, 'ob')
    plt.plot(i, (i-1)/6, 'ob', fillstyle = 'none')
plt.xlim([0, 7])
plt.ylim([-0.05, 1.1])
plt.title('Cumulative Distribution Function (CDF)')
plt.xlabel(r'$x$ (Number of points showing)', size=14)
plt.ylabel(r'$P(X\leq x)$', size=14)
plt.show()

Continous Distributions

Probability Density Function

Continuous random variables are described by their probability density functions (PDFs). If \(f\) is a PDF, then it has the following properties:

  • \(f(x) \geq 0,\)
  • \(\int_{- \infty}^{\infty} f(x) dx = 1.\)

The probability density function plays a central role in probability theory, because all probability statements about a continuous random variable can be answered in terms of its PDF.

For instance, for a continuous variable \(X\) where \(P(X\leq a) = \int_{-\infty}^a f(x) dx\), we obtain

\[ P(a \leq X \leq b) = \int_{-\infty}^b f(x) dx - \int_{-\infty}^a f(x) dx = \int_a^b f(x) dx. \]

Continuous Cumulative Distribution Function

Definition. The cumulative distribution function (CDF) \(F\) can be expressed as

\[F(a) = P(X\leq a) = \int_{-\infty}^a f(x) dx.\]

So continuing from the previous slide:

\[ P(a \leq X \leq b) = \int_{-\infty}^b f(x) dx - \int_{-\infty}^a f(x) dx = F(b) - F(a). \]

The relationship between the continuous CDF and PDF is

\[f(x) = \frac{d F(x)}{dx}.\]

The above formula tells us that the PDF is the derivative of the CDF.

Here is an example of a continuous CDF of some random variable.

Code
from scipy.stats import norm
plt.figure(figsize=(5, 3))
x = np.linspace(norm.ppf(0.001), norm.ppf(0.999), 100)
plt.plot(x, norm.cdf(x),'b-', lw=5, alpha=0.6)
plt.title('Cumulative Distribution Function (CDF)')
plt.xlabel(r'$x$', size=14)
plt.ylabel(r'$P(X\leq x)$', size=14)
plt.show()

Characterizing Random Variables

Definition. The expected value \(E[X]\) of a random variable \(X\) is the probability-weighted sum or integral of all possible values of the R.V.

For a discrete random variable, this is:

\[E[X] \equiv \sum_{x} x \cdot P(X=x).\]

For a continuous random variable with pdf \(p()\)

\[E[X] \equiv \int_{-\infty}^{+\infty} x p(x) dx.\]

The expected value is also called the average or the mean, although we prefer to reserve those terms for empirical statistics (actual measurements, not idealizations like these formulas).

The expected value is in some sense the “center of mass” of the random variable. It is often denoted \(\mu\).

The expected value is usually a quite useful characterization of the random variable.

However, be careful: in certain situations, it may not be very informative, or important.

In some cases a random variable may not ever take on the expected value as a possible value.

  • As an example, consider the single die, whose expected value is 3.5
  • In other cases the notion of expected value isn’t useful. Consider a person with their head in an oven and their feet in a freezer who claims “on average I feel fine.”

Importantly, the mean may not be very informative when observations are highly variable.

The variability of random quantities is crucially important in order to characterize them.

For this we use variance, the mean squared difference of the random variable from its expected value.

Definition. The variance of a random variable \(X\) is

\[ \text{Var}(X) \equiv \sigma^2 \equiv E[(X - E[X])^2]. \]

For example, given a discrete R.V. with \(E[X] = \mu\) this would be:

\[ \sigma^2 = \text{Var} (X) = \sum_{x} (x-\mu)^2 P(X=x). \]

We use the symbol \(\sigma^2\) to denote variance.

The units of variance are the square of the units of the mean.

So to compare variance and mean in the same units, we take the square root of the variance.

This is called the standard deviation and is denoted \(\sigma\).

Next, let’s recall the case of the Tesla and NVidia returns from the previous lecture:

Code
import pandas as pd
import yfinance as yf

stocks = ['TSLA', 'NVDA']
df = pd.DataFrame()
for s in stocks:
    df[s] = pd.DataFrame(yf.download(s, start='2023-01-01', end='2023-12-31', progress = False))['Close']

rets = df.pct_change(30)
rets[['TSLA', 'NVDA']].plot(lw=2)
plt.legend(loc='best')
plt.show()

Treating these two time-series as random variables, we are interested in how they vary together.

This is captured by the concept of covariance.

Definition. For two random variables \(X\) and \(Y\), their covariance is defined as:

\[ \text{Cov}(X,Y) = E\left[(X-\mu_X)(Y-\mu_Y)\right]. \]

If covariance is positive, this tells us that \(X\) and \(Y\) tend to both be above their means together and both below their means together.

We will often denote \(\text{Cov}(X,Y)\) as \(\sigma_{XY}\).

If we are interested in asking “how similar” are two random variables, we want to normalize covariance by the amount of variance shown by the random variables.

The tool for this purpose is correlation, i.e., normalized covariance:

\[ \rho(X,Y) = \frac{E\left[(X-\mu_X)(Y-\mu_Y)\right]}{\sigma_X \sigma_Y}. \]

\(\rho(X, Y)\) takes on values between -1 and 1.

If \(\rho(X, Y) = 0\) then \(X\) and \(Y\) are uncorrelated.

Note

Note that this is not the same thing as being independent! It just means there is no linear relationship between the two.

But independence implies uncorrelated. Plug in equations and check.

\(\rho\) is sometimes called “Pearson \(r\)” after Karl Pearson who popularized it.

Let’s look at our example again:

Code
rets = df.pct_change(30)
rets[['TSLA', 'NVDA']].plot(lw=2)
plt.legend(loc='best')
plt.show()

Code
df.cov()
TSLA NVDA
TSLA 1757.018115 388.467438
NVDA 388.467438 115.701324

In the case of Tesla (\(X\)) and NVidia (\(Y\)) above, we find that

\[\text{Cov}(X,Y) \approx 388.\]

How similar are these random variables? Let’s compute \(\rho(X,Y).\)

Code
df.corr()
TSLA NVDA
TSLA 1.000000 0.861583
NVDA 0.861583 1.000000

We observe that

\[\rho(X,Y) \approx 0.86.\]

There appears to be some similarity between the two stocks.

Multivariate Random Variables

A multivariate random variable is a vector of random variables.

We often simply say a “random vector”.

That is,

\[\mathbf{X} = \left[\begin{array}{c}X_1 \\ X_2 \\ \vdots \\ X_n\end{array}\right].\]

The expected value of a random vector is obtained by taking the expected value of each component random variable:

\[ E[\mathbf{X}] = \mathbf{\mu_X} = \left[\begin{array}{c}E[X_1] \\ E[X_2] \\ \vdots \\ E[X_n]\end{array}\right].\]

To properly characterize the variability of a random vector, we need to specify all the covariances of pairs of components.

We organize these values into a covariance matrix:

\[ \text{Cov}[\mathbf{X}] = \begin{bmatrix} \text{Var}(X_1) & \text{Cov}(X_1, X_2) & \dots & \text{Cov}(X_1, X_n) \\ \text{Cov}(X_2, X_1) & \text{Var}(X_2) & \dots & \text{Cov}(X_2, X_n)\\ \vdots & \vdots & \ddots & \vdots\\ \text{Cov}(X_n,X_1) & \text{Cov}(X_n, X_2) & \dots & \text{Var}(X_n)\\ \end{bmatrix}.\]

A couple things to note about a covariance matrix.

  1. The covariance matrix is symmetric (because \(\text{Cov}[X_i, X_j] = \text{Cov}(X_j, X_i)\))
  2. The covariance matrix is positive semidefinite, which means \(\mathbf{z}^T \text{Cov}[X] \mathbf{z}\geq 0\) for all vectors \(\mathbf{z}\).

Random Variables as Vectors

When working with data, we will often treat multiple observations of some feature as samples of a random variable.

We will also typically organize the observations into a vector.

Recall our stock data:

Code
df.head()
TSLA NVDA
Date
2023-01-03 00:00:00+00:00 108.099998 14.315
2023-01-04 00:00:00+00:00 113.639999 14.749
2023-01-05 00:00:00+00:00 110.339996 14.265
2023-01-06 00:00:00+00:00 113.059998 14.859
2023-01-09 00:00:00+00:00 119.769997 15.628

Each column can be treated as a vector.

So:

  • Let’s say that our data frame df is represented as a matrix \(D\),
  • and that df['TSLA'] are observations of some random variable \(X\),
  • and df['NVDA'] are observations of some random variable \(Y\).

Let \(D\) have \(m\) rows (i.e., \(m\) observations).

Now, let us subtract from each column its mean, to form a new matrix \(\tilde{D}\).

In the new matrix \(\tilde{D}\), every column has zero mean.

Then notice the following: the Covariance matrix of \((X, Y)\) is simply \(\frac{1}{m}\tilde{D}^T\tilde{D}\).

To see this compute

\[\begin{align*} \text{Cov}(X,Y) &= E\left[(X-\mu_X)(Y-\mu_Y)\right] \\ &= \frac{1}{m} \sum_i (\tilde{D}_{i1} \cdot \tilde{D}_{i2})\\ &= \frac{1}{m}\;\tilde{d}_1^T\tilde{d}_2, \end{align*}\]

where \(\tilde{d}_1\) and \(\tilde{d}_2\) are the columns of \(\tilde{D}\).

This shows that covariance is actually an inner product between normalized observation vectors.

Low and High Variability

Historically, most sources of random variation that have concerned statisticians are instances of low variability.

The original roots of probability in the study of games of chance, and later in the study of biology and medicine, have mainly studied objects with low variability.

Note that by “low variability” I don’t mean that such variability is unimportant.

Some examples of random variation in this category are:

  • the heights of adult humans,
  • the number of trees per unit area in a mature forest,
  • the sum of 10 rolls of a die,
  • the time between emission of subatomic particles from a radioactive material.

In each of these cases, there are a range of values that are “typical,” and there is a clear threshold above what is typical, that essentially never occurs.

On the other hand, there are some situations in which variability is quite different.

In these cases, there is no real “typical” range of values, and arbitrarily large values can occur with non-negligible frequency.

Some examples in this category are

  • the distribution of wealth among individuals in society,
  • the sizes of human settlements,
  • the areas burnt in forest fires,
  • the runs of gains and losses in various financial markets over time,
  • and the number of collaborators a scholar has over their lifetime.

Example

The banking system (betting against rare events) just lost [more than] 1 Trillion dollars (so far) on a single error, more than was ever earned in the history of banking.

Nassim Nicholas Taleb, September 2008

An example of a run of observations showing high variability. This figure shows the daily variations in a derivatives portfolio over the time-frame 1988-2008. About 99% of the variation over the 20 years occurs in a single day (the day the European Monetary System collapsed).

Important Distributions

Important Distributions

Now we will review certain distributions that come up over and over again in typical situations.

The Bernoulli Distribution

An experiment of a particularly simple type is one in which there are only two possible outcomes, such as

  • head or tail
  • success or failure
  • defective or non-defective component
  • patient recovers or does not recover

Each distribution has one or more parameters. Parameters are settings that control the distribution. A Bernoulli distribution has one parameter, \(p\), which is the probability that the random variable is equal to 1.

Definition. It is said that a random variable \(X\) has a Bernoulli distribution with parameter \(p\) \((0\leq p \leq 1)\) if \(X\) can take only the values 0 and 1 and the corresponding probabilities are

\[P(X=1) = p \: \text{ and } \: P(X=0) = 1-p.\]

Note that there is a particularly concise way of writing the above definition:

\[ p(x) = P(X=x) = p^x (1-p)^{(1-x)} \: \text{ for } \: x = 0 \text{ and } x=1.\]

The mean of a \(X\) is \(p\) and the variance of \(X\) is \(p(1-p)\).

The Binomial Distribution

The binomial distribution considers precisely \(N\) Bernoulli trials. Each trial has the probability of a success equal to \(p\). \(N\) and \(p\) are the parameters of the binomial distribution.

The binomial distribution answers the question “What is the probability there will be \(k\) successes in \(N\) trials?”

Definition. If \(X\) represents the number of successes that occur in \(N\) trials, then \(X\) is said to have a binomial distribution with parameters \(N\) and \(p\) \((0\leq p \leq 1)\). The PMF of a binomial random variable is given by

\[p(k) = P(X=k) = \binom{N}{k}\; p^k\; (1-p)^{N-k} \: \text{ for } k=0,1,\ldots,N.\]

The validity of the above PMF can be verified as follows. First we notice, that, by the assumed independence of trials, for any given sequence of \(k\) successes and \(N-k\) failures, the probability is \(p^k \;(1-p)^{N-k}\). Then there are \(\binom{N}{k}\) different sequences of the \(N\) outcomes leading to \(k\) successes and \(N-k\) failures.

The mean of the Binomial distribution is \(pN\), and its variance is \(p(1-p)N\).

The PMF of the Binomial distribution with \(N = 10\) and \(p=0.3\) is shown below.

Code
from scipy.stats import binom
p = 0.3
x = np.arange(binom.ppf(0.01, 10, p), binom.ppf(0.9995, 10, p))
plt.ylim([0, 0.4])
plt.xlim([-0.5, max(x)+0.5])
plt.plot(x, binom.pmf(x, 10, p), 'bo', ms=8, label = 'binom pmf')
plt.vlines(x, 0, binom.pmf(x, 10, p), colors='b', lw = 5, alpha=0.6)
plt.title(f'Binomial PDF, $p$ = {p}, $N$ = 10', size=14)
plt.xlabel(r'$k$', size=14)
plt.ylabel(r'$P(X = k)$', size=14)
plt.show()

The Geometric Distribution

The geometric distribution concerns Bernoulli trials as well. It has only one parameter \(p\), the probability of success.

The geometric distribution answers the question: “What is the probability it takes \(k\) trials to obtain the first success?”

Definition. It is said that a random variable \(X\) has a geometric distribution with parameter \(p\) \((0\leq p \leq 1)\) if \(X\) has a discrete distribution with

\[ p(k) = P(X = k) = p(1-p)^{k-1} \qquad \text{for} \: k \geq 1.\]

The mean of the geometric distribution is equal to \(\frac{1}{p}\) and its variance is \(\frac{1-p}{p^2}\).

An example of the geometric PMF is given below.

Code
from scipy.stats import geom
p = 0.3
x = np.arange(geom.ppf(0.01, p), geom.ppf(0.995, p))
plt.ylim([0, 0.4])
plt.xlim([0.5, max(x)])
plt.plot(x, geom.pmf(x, p), 'bo', ms=8, label = 'geom pmf')
plt.vlines(x, 0, geom.pmf(x, p), colors='b', lw = 5, alpha = 0.6)
plt.title(f'Geometric PDF, $p$ = {p}', size=14)
plt.xlabel(r'$k$', size=14)
plt.ylabel(r'$P(X = k)$', size=14)
plt.show()

The Poisson Distribution

Although the Bernoulli trials underlie all of the previous distributions, they do not form the basis for the Poisson distribution. To introduce the Poisson distribution we will look at some examples of random variables that generally obey the Poisson probability law:

  1. The number of misprints on a page (or a group of pages) of a book;
  2. The number of people in a community who survive to age 100;
  3. The number of wrong telephone numbers that are dialed in a day;
  4. The number of customers entering a post office on a given day;
  5. The number of \(\alpha\)-particles discharged in a fixed period of time from some radioactive material.

In the above examples the events appear to happen at a certain rate, but completely at random (i.e., without a certain structure).

A Poisson distribution answers the question: “How many successes occur in a fixed amount of time?”

Definition. A random variable \(X\) that takes on one of the values \(0, 1, 2,...\) is said to have a Poisson distribution with parameter \(\lambda\) if, for some \(\lambda > 0\),

\[ p(k) = P(X=k) = \lambda^k \frac{e^{- \lambda}}{k!} \: \text{ for } k = 0, 1, 2,... \]

This \(p(k)\) defines a PMF, because

\[\sum_{k=0}^{\infty} p(k) = \sum_{k=0}^{\infty} \lambda^k \frac{e^{- \lambda}}{k!} = e^{- \lambda} \sum_{k=0}^{\infty} \frac{\lambda^k}{k!} = e^{- \lambda} e^{\lambda} = 1.\]

Here, we used the fact that the exponential function \(e^{\lambda}\), can be expressed as series, \(\sum_{k=0}^{\infty} \frac{\lambda^k}{k!}\) for every real number \(\lambda\).

Both the mean and the variance of a Poisson distribution with parameter \(\lambda\) are equal to \(\lambda.\)

The PMF of the Poisson distribution with parameter \(\lambda = 0.3\) is shown below.

Code
from scipy.stats import poisson
mu = 3
x = np.arange(poisson.ppf(0.01, mu), poisson.ppf(0.9995, mu))
plt.xlim([-0.5, max(x)+0.5])
plt.ylim(ymax = 0.4)
plt.plot(x, poisson.pmf(x, mu), 'bo', ms=8, label='poisson pmf')
plt.vlines(x, 0, poisson.pmf(x, mu), colors='b', lw=5, alpha=0.6)
plt.title(f'Poisson PDF.  $\lambda$ = {mu/10}', size=14)
plt.xlabel(r'k', size=14)
plt.ylabel(r'P(X = k)', size=14)
plt.show()
<>:8: SyntaxWarning: invalid escape sequence '\l'
<>:8: SyntaxWarning: invalid escape sequence '\l'
/tmp/ipykernel_4689/1509079810.py:8: SyntaxWarning: invalid escape sequence '\l'
  plt.title(f'Poisson PDF.  $\lambda$ = {mu/10}', size=14)

The Poisson random variable has a tremendous range of applications in diverse areas. The reason for that is the fact that a Poisson distribution with mean \(Np\) may be used as an approximation for a binomial distribution with parameters \(N\) and \(p\) when \(N\) is large and \(p\) is small enough so that \(Np\) is of moderate size.

The Poisson distribution has an interesting role in our perception of randomness (which you can read more about here).

The classic example comes from history.

In 1898 Ladislaus Bortkiewicz, a Russian statistician of Polish descent, was trying to understand why, in some years, an unusually large number of soldiers in the Prussian army were dying due to horse-kicks. In a single army corp, there were sometimes 4 such deaths in a single year. Was this just coincidence?

To assess whether horse-kicks were random (not following any pattern) Bortkiewicz simply compared the number per year to what would be predicted by the Poisson distribution.

Code
# note that this data is available in 'data/HorseKicks.txt'
horse_kicks = pd.DataFrame(
data = np.array([
[0, 108.67, 109],
[1, 66.29, 65],
[2, 20.22, 22],
[3, 4.11, 3],
[4, 0.63, 1],
[5, 0.08, 0],
[6, 0.01, 0]]),
columns = ["Number of Deaths Per Year","Predicted Instances (Poisson)","Observed Instances"])
horse_kicks.set_index("Number of Deaths Per Year", inplace=True)
horse_kicks
Predicted Instances (Poisson) Observed Instances
Number of Deaths Per Year
0.0 108.67 109.0
1.0 66.29 65.0
2.0 20.22 22.0
3.0 4.11 3.0
4.0 0.63 1.0
5.0 0.08 0.0
6.0 0.01 0.0
Code
horse_kicks[["Predicted Instances (Poisson)","Observed Instances"]].plot.bar()
plt.xlabel("Number of Deaths Per Year", size=14)
plt.ylabel("Count", size=14)
plt.show()

The message here is that when events occur at random, we actually tend to perceive them as clustered.

Here is another example:

Which of these was generated by a random process occurring equally likely everywhere?

These images are from Steven Pinker’s book, The Better Angels of our Nature.

In the left figure, the number of dots falling into regions of a given size follows the Poisson distribution.

The Exponential Distribution

The exponential distribution is an example of a continuous distribution. It concerns a Poisson process and has one parameter, \(\lambda\), the rate of success.

Definition. A continuous random variable whose PDF is given, for some \(\lambda > 0\), by

\[f(x) = \;\left\{\begin{array}{cr} \lambda e^{-\lambda x} & \text {for } x \geq 0, \\ 0 & \text{otherwise}\end{array}\right.\]

is said to be exponentially distributed with parameter \(\lambda\).

The CDF, \(F(a)\), of an exponential random variable is given by

\[F(a) = P(X \leq a) = 1 - e^{-\lambda a} \mbox{ for } a \geq 0.\]

Its mean is \(1/\lambda\), and the variance is \(1/\lambda^2\).

The exponential distribution answers the question: “What is the probability it takes time \(x\) to obtain the first success?”

Code
from scipy.stats import expon
p = 0.3
x = np.linspace(expon.ppf(0.01, scale=1/p), expon.ppf(0.995, scale=1/p), 100)
plt.plot(x, expon.pdf(x, scale=1/p),'b-', lw = 5, alpha = 0.6, label='expon pdf')
plt.title(f'Exponential PDF, $\lambda$ = {p}', size=14)
plt.xlabel(r'$x$', size=14)
plt.ylabel(r'$p(x)$', size=14)
plt.ylim([0, 0.4])
plt.xlim([0, 14])
plt.show()
<>:5: SyntaxWarning: invalid escape sequence '\l'
<>:5: SyntaxWarning: invalid escape sequence '\l'
/tmp/ipykernel_4689/3643123630.py:5: SyntaxWarning: invalid escape sequence '\l'
  plt.title(f'Exponential PDF, $\lambda$ = {p}', size=14)

The exponential is the continuous analog of the geometric distribution.

Code
from math import exp 
lam = 0.3
p = 1 - exp(- lam)
x = np.linspace(expon.ppf(0, scale=1/lam), expon.ppf(0.995, scale=1/lam), 100)
plt.plot(x, expon.cdf(x, scale=1/lam),'b-', lw = 5, alpha = 0.6, label='Exponential')
xg = np.arange(geom.ppf(0, p), geom.ppf(0.995, p))
plt.ylim([0, 1])
plt.xlim([-0.5, max(xg)])
plt.plot(xg, geom.cdf(xg, p), 'ro', ms = 8, label = 'Geometric')
plt.suptitle(f'Geometric and Exponential CDFs', size = 14)
plt.title(r'$\lambda = 0.3; \;\;\; p = 1-e^{-\lambda} =$' + f' {p:0.3f}', size=12)
plt.xlabel(r'$x$', size=14)
plt.ylabel(r'$P(X \leq x)$', size = 14)
plt.legend(loc = 'best')
plt.show()

The Uniform Distribution

The uniform distribution models the case in which all outcomes are equally probable.

It can be a discrete or continuous distribution.

We have already seen the uniform distribution in the case of rolls of a fair die.

Code
plt.figure()
x = np.arange(1, 7)
plt.plot(x, 6*[1/6.], 'bo', ms=8)
plt.vlines(x, 0, 1/6., colors='b', lw=5, alpha=0.5)
plt.xlim([0.5, 6.5])
plt.ylim([0, 1.1])
plt.xlabel(r'x (Number of points showing)', size=14)
plt.ylabel(r'$P(X = x)$', size=14)
plt.show()

There is an important relationship between the uniform and Poisson distributions.

When the time an event occurs is uniformly distributed, the number of events in a time interval is Poisson distributed.

You can replace “time” with “location”, and so on.

Also, the reverse statment is true as well.

So a simple way to generate a picture like the scattered points above is to select the \(x\) and \(y\) coordinates of each point uniformly distributed over the picture size.

The Gaussian Distribution

The Gaussian Distribution is also called the Normal Distribution.

We will make extensive use of Gaussian distribution, for a number of reasons.

One of reasons we will use it so much is that it is a good guess for how errors are distributed in data.

This comes from the celebrated Central Limit Theorem. Informally,

The sum of a large number of independent observations from any distribution with finite variance tends to have a Gaussian distribution.

As a special case, the sum of \(n\) independent Gaussian variates is Gaussian.

Thus Gaussian processes remain Gaussian after passing through linear systems.

If \(X_1\) and \(X_2\) are Gaussian, then \(X_3 = aX_1 + bX_2\) is Gaussian.

One way of thinking of the Gaussian is that it is the limit of the Binomial when \(N\) is large, that is, the limit of the sum of many Bernoulli trials.

However, because of the central limit theorem, many other sums of random variables (not just Bernoulli trials) converge to the Gaussian.

The standard Gaussian distribution has mean zero and a variance (and standard deviation) of 1. The pdf of the standard Gaussian is:

\[p(x) = \frac{1}{\sqrt{2 \pi}} e^{-x^2/2}.\]

For an arbitrary Gaussian distribution with mean \(\mu\) and variance \(\sigma^2\), the pdf is simply the standard Gaussian that is relocated to have its center at \(\mu\) and its width scaled by \(\sigma\)

\[ p_{\mu,\sigma}(x) = \frac{1}{\sigma \sqrt{2 \pi}} e^{-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^2}.\]

Code
from scipy.stats import norm
plt.figure()
x = np.linspace(norm.ppf(0.001), norm.ppf(0.999), 100)
plt.plot(x, norm.pdf(x),'b-', lw = 5, alpha = 0.6)
plt.title(r'Standard Gaussian PDF.  $\mu = 0, \sigma = 1$', size=14)
plt.xlabel('x', size=14)
plt.ylabel(r'$p(x)$', size=14)
plt.show()

Code
plt.figure()
x = np.linspace(norm.ppf(0.001), norm.ppf(0.999), 100)
plt.plot(x, norm.cdf(x),'b-', lw = 5, alpha = 0.6)
plt.title(r'Standard Gaussian CDF.  $\mu = 0, \sigma = 1$', size=14)
plt.xlabel('x', size=14)
plt.ylabel(r'$P(X\leq x)$', size=14)
plt.show()

Heavy Tails

Earlier we discussed high- and low-variability.

All of the distributions we have discussed so far have “light tails”, meaning that they show low variability.

In other words, extremely large observations are essentially impossible.

However in other cases, extremely large observations can occur. Distributions that capture this property are called “heavy tailed”.

Some examples of data that can be often modeled using heavy-tailed distributions:

  • The sizes of files in a file system.
  • The sizes of objects transferred over the Internet.
  • The execution time of jobs on a computer system.
  • The degree of nodes in a network (e.g., social network).

In practice, random variables that follow heavy tailed distributions are characterized as exhibiting many small observations mixed in with a few large observations.

In such datasets, most of the observations are small, but most of the contribution to the sample mean or variance comes from the rare, large observations.

The Pareto Distribution

The Pareto distribution is the simplest continuous heavy-tailed distribution.

Pareto was an Italian economist who studied income distributions. (In fact, income distributions typically show heavy tails.)

Its pdf is

\[ p(x) = \alpha k^{\alpha} x^{-\alpha-1}\;\;\; k \geq x,\; \;0 < \alpha \leq 2.\]

It takes on values in the range \([k, \infty]\).

Code
from scipy.stats import pareto
alpha = 1.3
x = np.linspace(pareto.ppf(0.005, alpha), pareto.ppf(0.995, alpha), 100)
plt.plot(x, pareto.pdf(x, alpha),'b-', lw = 5, alpha = 0.6, label='pareto pdf')
plt.title(r'Pareto PDF.  $\alpha$ = {}'.format(alpha), size=14)
plt.xlabel(r'x', size=14)
plt.ylabel(r'p(x)', size=14)
plt.show()

The variance of the Pareto distribution is infinite. (The corresponding integral diverges.)

In practice, this means that a new observation that significantly changes the sample variance is always possible, no matter how many samples of the random variable have already been taken.

The mean of the Pareto is \(\frac{k\alpha}{\alpha-1}\), for \(\alpha > 1\).

But note that as \(\alpha\) decreases, the variability of the Pareto increases.

In fact, for \(\alpha \leq 1\), the Pareto distribution has infinite mean. Again, in practice this means that a swamping observation for the mean is always possible.

Hence the running average of a series of Pareto observations with \(\alpha \leq 1\) will never converge to a fixed value, and the mean itself is not a useful statistic in this case.

The Multivariate Gaussian

The most common multivariate distribution we will work with is the multivariate Gaussian.

The multivariate normal distribution of a random vector \(\mathbf{X} = (X_1, \dots, X_k)^T\) is denoted

\[\mathbf{X} \sim \mathcal{N}(\mathbf{\mu}, \Sigma)\]

where \(\mathbf{\mu} = E[\mathbf{X}] = (E[X_1], \dots, E[X_k])^T\)

and \(\Sigma\) is the \(k \times k\) covariance matrix where \(\Sigma_{i,j} = \text{Cov}(X_i, X_j)\).

Here are some examples.

We’ll consider two-component random vectors:

\[\mathbf{X} = \begin{bmatrix}X_1\\X_2\end{bmatrix}.\]

And our first example will be a simple one

\[ \mathbf{\mu} = \begin{bmatrix}1\\1\end{bmatrix}\;\;\;\;\Sigma = \begin{bmatrix}1 & 0\\0 & 1\end{bmatrix}.\]

We see that the variance (and standard deviation) of each component is 1.

However the covariances are zero – the components are uncorrelated.

We will take 600 samples from this distribution.

Code
from scipy.stats import multivariate_normal
np.random.seed(4)
df1 = pd.DataFrame(multivariate_normal.rvs(mean = np.array([1, 1]),
                                           cov = np.eye(2), 
                                           size = 600), 
                                           columns = ['X1', 'X2'])
Code
g = sns.JointGrid(data = df1, x = 'X1', y = 'X2', height = 5)
g.plot(sns.scatterplot, sns.kdeplot)
g.ax_joint.plot(1, 1, 'ro', markersize = 6)
g.ax_marg_x.plot(1, 0, 'ro')
g.ax_marg_y.plot(0, 1, 'ro')
plt.show()

Code
g = sns.JointGrid(data = df1, x = 'X1', y = 'X2', height = 5)
g.plot(sns.kdeplot, sns.kdeplot)
g.ax_joint.plot(1, 1, 'ro', markersize = 6)
g.ax_marg_x.plot(1, 0, 'ro')
g.ax_marg_y.plot(0, 1, 'ro')
plt.show()

Code
np.random.seed(4)
df1 = pd.DataFrame(multivariate_normal.rvs(mean = np.array([1, 1]), 
                                           cov = np.array([[1, 0.8],[0.8, 1]]), 
                                           size = 600),
                                           columns = ['X1', 'X2'])

Next, we look at the case

\[\mathbf{\mu} = \begin{bmatrix} 1 \\ 1 \end{bmatrix} \qquad \Sigma = \begin{bmatrix} 1 & 0.8 \\ 0.8 & 1 \end{bmatrix}.\]

Notice that \(\text{Cov}(X_1, X_2) = 0.8\).

We say that the components are positively correlated.

Nonetheless, the marginals are still Gaussian.

Code
g = sns.JointGrid(data = df1, x = 'X1', y = 'X2', height = 5)
g.plot(sns.scatterplot, sns.kdeplot)
g.ax_joint.plot(1, 1, 'ro', markersize = 6)
g.ax_marg_x.plot(1, 0, 'ro')
g.ax_marg_y.plot(0, 1, 'ro')
plt.show()

Code
g = sns.JointGrid(data = df1, x = 'X1', y = 'X2', height = 5)
g.plot(sns.kdeplot, sns.kdeplot)
g.ax_joint.plot(1, 1, 'ro', markersize = 6)
g.ax_marg_x.plot(1, 0, 'ro')
g.ax_marg_y.plot(0, 1, 'ro')
plt.show()

Code
np.random.seed(4)
df1 = pd.DataFrame(multivariate_normal.rvs(mean = np.array([1, 1]), 
                                           cov = np.array([[1, -0.8],[-0.8, 1]]), 
                                           size = 600), 
                                           columns = ['X1', 'X2'])

Next, we look at the case

\[\mathbf{\mu} = \begin{bmatrix} 1 \\ 1 \end{bmatrix} \qquad \Sigma = \begin{bmatrix} 1 & -0.8 \\ -0.8 & 1 \end{bmatrix}.\]

Notice that \(\text{Cov}(X_1, X_2) = -0.8\). We say that the components are negatively correlated or anticorrelated.

Code
g = sns.JointGrid(data = df1, x = 'X1', y = 'X2', height = 5)
g.plot(sns.scatterplot, sns.kdeplot)
g.ax_joint.plot(1, 1, 'ro', markersize = 6)
g.ax_marg_x.plot(1, 0, 'ro')
g.ax_marg_y.plot(0, 1, 'ro')
plt.show()

Code
g = sns.JointGrid(data = df1, x = 'X1', y = 'X2', height = 5)
g.plot(sns.kdeplot, sns.kdeplot)
g.ax_joint.plot(1, 1, 'ro', markersize = 6)
g.ax_marg_x.plot(1, 0, 'ro')
g.ax_marg_y.plot(0, 1, 'ro')
plt.show()

Finally, let’s look at our stock data:

Code
g = sns.JointGrid(data = df, x = 'TSLA', y = 'NVDA', height = 5)
g.plot(sns.scatterplot, sns.kdeplot)
g.ax_joint.plot(df.mean()['TSLA'], df.mean()['NVDA'], 'ro', markersize = 6)
plt.show()

Code
g = sns.JointGrid(data = df, x = 'TSLA', y = 'NVDA', height = 5)
g.plot(sns.kdeplot, sns.kdeplot)
g.ax_joint.plot(df.mean()['TSLA'], df.mean()['NVDA'], 'ro', markersize = 6)
plt.show()

Recall that the correlation between these two stocks was about 0.86.

That is, the stocks are positively correlated.

Confidence Intervals

Say you are concerned with some data that we take as coming from a random process.

You want to characterize it as accurately as possible. You measure it, yielding a single value.

How much does that value tell you? Can you rely on it as a description of the random process?

Let’s say you have a dataset and you compute its average value.

How certain are you that the average would be the same if you took another dataset from the same source (i.e., the same random process)?

We think of the hypothetical data source as a random variable with a true mean \(\mu\).

(Note that we are using frequentist style thinking here.)

We would like to find a range within which we are 90% sure that the true mean \(\mu\) lies.

In other words, we want the probability that the true mean lies in the interval to be 0.9.

This interval is then called the 90% confidence interval.

To be more precise: A confidence interval at level \(\gamma\) for a fixed but unknown parameter \(m\) is an interval \((a,b)\) such that

\[P(A < m < B) \geq \gamma.\]

Note that \(m\) is fixed — it is not random.

What is random is the interval \((A, B)\).

This interval is constructed based on the data, which (by assumption) are random.

Confidence Intervals for the Mean

Imagine we have a set of \(n\) samples of a random variable, \(x_1, x_2, ..., x_n\) Let’s assume that the random variable has mean \(\mu\) and variance \(\sigma^2\).

An estimate of \(\mu\) is the empirical average of the samples, \(\bar{x}\).

Now, the Central Limit Theorem tells us that the sum of a large number \(n\) of random variables, each with mean \(\mu\) and variance \(\sigma^2\), yields a Gaussian random variable with mean \(n\mu\) and variance \(n \sigma^2\).

So the distribution of the average of \(n\) samples is normal with mean \(\mu\) and variance \(\sigma^2 / n\). That is,

\[ \bar{x} \sim \mathcal{N}(\mu, \sigma/\sqrt{n}) \]

We usually assume that the number of samples should be 30 or more for the CLT to hold.

While the specific value 30 is a bit arbitrary, we will usually be using very large samples (datasets) in this course for which this assumption is valid.

The standard deviation of the sample mean is called the standard error.

Notice that the standard error decreases as we increase the sample size, according to \(1/\sqrt{n}.\)

So it will turn out that using \(\bar{x}\), we can get increasingly “tight” estimates of \(\mu\) as we increase the number of samples \(n\).

Now, remember that the true mean \(\mu\) is a constant, while the empirical mean \(\bar{x}\) is a random variable.

Let us assume for a moment that we know the true \(\mu\) and \(\sigma\), and that we accept that \(\bar{x}\) has a \(N(\mu, \sigma/\sqrt{n})\) distribution.

Then it is true that

\[ P(\mu-k\sigma/\sqrt{n} < \bar{x} < \mu+k\sigma/\sqrt{n}) = P(-k < S < k) \]

where \(S\) is the standard Gaussian random variable (having distribution \(N(0,1)\)).

We write \(z_{1-\alpha/2}\) to be the \(1-\alpha/2\) quantile of the unit normal. That is,

\[ P(-z_{1-\alpha/2} < S < z_{1-\alpha/2}) = 1-\alpha.\]

So to form a 90% probability interval for \(S\) (centered on zero) we choose \(k = z_{0.95}\).

Code
from scipy.stats import norm
plt.figure()
x = np.linspace(norm.ppf(0.001), norm.ppf(0.999), 100)
x90 = np.linspace(norm.ppf(0.05), norm.ppf(0.95), 100)
plt.plot(x, norm.pdf(x),'b-')
plt.fill_between(x90, 0, norm.pdf(x90))
plt.title(r'90% region for Standard Gaussian', size = 14)
plt.xlabel('x', size = 14)
plt.ylabel(r'$p(x)$', size = 14)
plt.show()

Turning back to \(\bar{x}\), the 90% probability interval on \(\bar{x}\) would be:

\[ \mu-z_{0.95}\sigma/\sqrt{n} < \bar{x} < \mu+z_{0.95}\sigma/\sqrt{n}. \]

The last step: by a simple argument, we can show that the sample mean is in some fixed-size interval centered on the true mean, if and only if the true mean is also in a fixed-size interval (of the same size) centered on the sample mean.

This means that:

\[\begin{align*} 1-\alpha & = P(\mu-z_{1-\alpha/2}\sigma/\sqrt{N} < \bar{x} < \mu+z_{1-\alpha/2}\sigma/\sqrt{N}) \\ & = P(\bar{x}-k\sigma\sqrt{N} < \mu < \bar{x}+k\sigma/\sqrt{N}). \end{align*}\]

This latter expression defines the \(1-\alpha\) confidence interval for the mean.

We are done, except for estimating \(\sigma\). We do this directly from the data: \(\hat{\sigma} = s\), where \(s\) is the sample standard deviation, that is,

\(s = \sqrt{1/(n-1) \sum (x_i - \bar{x})^2}\).

To summarize: by the argument presented here, a 100(1-\(\alpha\))% confidence interval for the population mean is given by

\[\bar{x} \pm z_{1-\alpha/2} \, \frac{s}{\sqrt{n}}. \]

As an example, a 95% confidence interval for the mean is the sample average plus or minus two standard errors.

Back to top