Central Limit Theorem Proof

I am documenting my understanding of the Central Limit Theorem, because it seems so foundational for statistics, which seems so foundational for machine learning theory. For example, the theorem is the reason we can assume that noise/error is normally distributed in many situations.

An assumption of this post is that you have some knowledge of elementary calculus and statistics. But I will try to remind you of some important things from those topics to aid in understanding, just in case they were forgotten.

What is the Central Limit Theorem?

The Central Limit Theorem states that the sample mean of a distribution will be approximately normally distributed for large sample sizes, regardless of the distribution from which we are sampling.

If we draw $n$ independent and identically distributed (i.i.d.) random samples ${X_1, X_2, … X_n}$ from a distribution, and:

Mean $E[X_i] = \mu$
Finite variance $Var(X_i) = \sigma^2$
Sum of samples $S_n = X_1 + X_2 + … + X_n$
Sample mean $\bar{X_n} = \frac{S_n}{n}$

Then the sample mean $\bar{X_n}$ approaches a normal distribution $N(\mu, \frac{\sigma^2}{n})$ as $n \rightarrow \infty$.

An alternate definition is that the normalized sample mean (or the normalized sum of samples) $Z = \frac{\bar{X_n} - E[\bar{X_n}]}{\sqrt{Var(\bar{X_n})}} = \frac{S_n - E[S_n]}{\sqrt{Var(S_n)}} = \sqrt{n}(\frac{\bar{X_n} - \mu}{\sigma})$ approaches the standard normal distribution $N(0, 1)$ as $n \rightarrow \infty$. Derivation from our initially-stated definition:

$\bar{X_n} \rightarrow N(\mu, \frac{\sigma^2}{n})$
$\bar{X_n} - \mu \rightarrow N(0, \frac{\sigma^2}{n})$
$\sqrt{\frac{n}{\sigma^2}}(\bar{X_n} - \mu) \rightarrow \sqrt{\frac{n}{\sigma^2}}N(0, \frac{\sigma^2}{n})$
- Recall that $Var(aX) = a^2Var(X)$ for scalar $a$ and variable $X$
$\sqrt{n}(\frac{\bar{X_n} - \mu}{\sigma}) \rightarrow N(0,(\frac{n}{\sigma^2})(\frac{\sigma^2}{n}))$
$\sqrt{n}(\frac{\bar{X_n} - \mu}{\sigma}) \rightarrow N(0, 1)$

Proof using moment generating functions (MGFs)

Here I will document an informal proof of the Central Limit Theorem. It will be based on moment generating functions only because I learned about that recently.

Pre-requisites: What is a moment, and what is a moment generating function?

The nth moment of a random variable X is $E[X^n]$. The first moment is known as the mean $E[X] = \mu$. The second (central) moment is known as the variance $Var(X) = E[(X - \mu)^2] = \sigma^2$.

A moment generating function $M_X(t)$ for a random variable $X$ is a function that can generate the moments of $X$. We may want it because it may be easier to get the moments with it, instead of calculating integrals.

$M_X(t) = E[e^{tX}]$

To get the nth moment, we differentiate $M_X(t)$ $n$ times with respect to $t$, and then set $t = 0$.

Why does that work? Recall that the Taylor series of a function $f(x)$ that is differentiable at $a$ is $\sum_{n=0}^{\infty}\frac{f^{(n)}(a)}{n!}(x - a)^n = f(a) + \frac{f'(a)}{1!}(x - a) + \frac{f''(a)}{2!}(x - a)^2 + … + \frac{f^{(n)}(a)}{n!}(x - a)^n$

Try the aforementioned method for getting the nth moment, on the Taylor expansion for $a = 0$ of the moment generating function $M_X(t) = E[e^{tX}]$

$= E[e^{0X} + \frac{(e^{0X}X)(t - 0)}{1!} + \frac{(e^{0X}X^2)(t - 0)^2}{2!} + … + \frac{(e^{0X}X^n)(t - 0)^n}{n!}]$
$= E[1 + \frac{Xt}{1!} + \frac{(Xt)^2}{2!} + … + \frac{(Xt)^n}{n!}]$
$= 1 + \frac{E[X]t}{1!} + \frac{E[X^2]t^{2}}{2!} + … + \frac{E[X^n]t^{n}}{n!}$

Pre-requisites: Properties of expectations and variances

Here are some important properties of expectations and variances that will be used in the proof:

$E[X + Y] = E[X] + E[Y]$
$E[aX + b] = aE[X] + b$, for scalars $a$ and $b$, and for variable $X$
$E[XY] = E[X]E[Y]$, if variables $X$ and $Y$ are independent
$Var(X + Y) = Var(X) + Var(Y)$, if variables $X$ and $Y$ are independent
$Var(aX) = a^2Var(X)$, for scalar $a$ and variable $X$

Strategy for proof

A property of the MGF is that it uniquely determines the distribution. So if random variables $A$ and $B$ have the same MGF, then they have the same distribution.

So we are going to prove that the MGF of the normalized sample mean of the random samples is the same as the MGF of the standard normal distribution $N(0, 1)$.

Finding the MGF for N(0, 1)

Let’s find the MGF for the standard normal distribution $N(0, 1)$.

Recall that the formula for the normal distribution is: $f(x; \mu, \sigma) = \frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{1}{2}(\frac{x - \mu}{\sigma})^2}$

$N(0, 1) = f(x; 0, 1) = \frac{1}{\sqrt{2\pi}}e^{-\frac{1}{2}x^2}$.

$M_{X \sim N(0, 1)}(t) = E[e^{Xt}]$

$= \int_{-\infty}^{\infty}e^{xt}\frac{1}{\sqrt{2\pi}}e^{-\frac{x^2}{2}}dx$
$= \int_{-\infty}^{\infty}\frac{1}{\sqrt{2\pi}}e^{-\frac{1}{2}(x^2 - 2xt)}dx$
$= \int_{-\infty}^{\infty}\frac{1}{\sqrt{2\pi}}e^{-\frac{1}{2}(x^2 - 2xt + t^2)}e^{\frac{t^2}{2}}dx$
$= e^{\frac{t^2}{2}}\int_{-\infty}^{\infty}\frac{1}{\sqrt{2\pi}}e^{-\frac{1}{2}(x^2 - 2xt + t^2)}dx$
$= e^{\frac{t^2}{2}}\int_{-\infty}^{\infty}\frac{1}{\sqrt{2\pi}}e^{-\frac{1}{2}(x - t)^2}dx$
- $\int_{-\infty}^{\infty}\frac{1}{\sqrt{2\pi}}e^{-\frac{1}{2}(x - t)^2}dx$ must be 1, because it’s a Gaussian probability density function. Shifting x does not affect the integral.
$= e^{\frac{t^2}{2}}$

Aside: Finding the MGF for $N(\mu, \sigma^2)$ for fun

To find the MGF for the general normal distribution $N(\mu, \sigma^2)$, we’ll try to separate the part that forms the standard normal distribution, because we already found the MGF for that above.

$M_{X \sim N(\mu, \sigma^2)}(t) = E[e^{Xt}]$

$= \int_{-\infty}^{\infty}e^{xt}\frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{1}{2}(\frac{x - \mu}{\sigma})^2}dx$
- Let $z = \frac{x - \mu}{\sigma}$
- Then $x = z\sigma + \mu$ and $\frac{dx}{dz} = \sigma$ and so $dx = \sigma dz$.
$= \int_{-\infty}^{\infty}e^{z\sigma t}e^{\mu t}\frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{z^2}{2}}\sigma dz$
$= e^{\mu t}\int_{-\infty}^{\infty}e^{z\sigma t}\frac{1}{\sqrt{2\pi}}e^{\frac{z^2}{2}}dz$
- Recall from earlier that $\int_{-\infty}^{\infty}e^{xt}\frac{1}{\sqrt{2\pi}}e^{-\frac{x^2}{2}}dx = e^{\frac{t^2}{2}}$. So we can substitute its variable $t$ with $\sigma t$ so that it looks like our integral, which will then be equivalent to $e^{\frac{(\sigma t)^2}{2}}$
$= e^{\mu t + \frac{\sigma^2 t^2}{2}}$

Finding the MGF for the normalized sample mean

Let $Z$ be the normalized version of the sample mean $\bar{X_n}$.

$\bar{X_n} = \frac{X_1 + X_2 + … + X_n}{n}$
$E[\bar{X_n}]$
- $= E[\frac{X_1 + X_2 + … + X_n}{n}]$
- $= \frac{1}{n}E[X_1 + X_2 + … + X_n]$
- $= \frac{1}{n}(E[X_1]) + E[X_2] + … + E[X_n])$
- $= \frac{1}{n}n\mu$
- $= \mu$
$Var(\bar{X_n})$
- $= Var(\frac{X_1 + X_2 + … + X_n}{n})$
- $= \frac{1}{n^2}Var(X_1 + X_2 + … + X_n)$
- $= \frac{1}{n^2}(Var(X_1) + Var(X_2) + … + Var(X_n))$
- $= \frac{1}{n^2}n\sigma^2$
- $= \frac{\sigma^2}{n}$
$Z$
- $= \frac{\bar{X_n} - E[\bar{X_n}]}{\sqrt{Var(\bar{X_n})}}$
- $= \frac{\bar{X_n} - \mu}{\sqrt{\frac{\sigma^2}{n}}}$
- $= \sqrt{n}(\frac{\bar{X_n} - \mu}{\sigma})$
- $= (\frac{n\bar{X_n} - n\mu}{\sigma\sqrt{n}})$
- $= (\frac{(\sum_{i=1}^{n}{x_i}) - n\mu}{\sigma\sqrt{n}})$
  - Note that as mentioned in the theorem definition section, this is the same as the normalized sum of the samples

Now let’s find the moment generating function for $Z$.

$M_Z(t) = E[e^{Zt}]$

$= E[e^{t(\frac{X_1 - \mu}{\sigma\sqrt{n}} + \frac{X_2 - \mu}{\sigma\sqrt{n}} + … + \frac{X_n - \mu}{\sigma\sqrt{n}})}]$
$= E[e^{t(\frac{X_1 - \mu}{\sigma\sqrt{n}})}e^{t(\frac{X_2 - \mu}{\sigma\sqrt{n}})} … e^{t(\frac{X_n - \mu}{\sigma\sqrt{n}})}]$
$= E[e^{t(\frac{X_1 - \mu}{\sigma\sqrt{n}})}]E[e^{t(\frac{X_2 - \mu}{\sigma\sqrt{n}})}] … E[e^{t(\frac{X_n - \mu}{\sigma\sqrt{n}})}]$
$= E[e^{t(\frac{X - \mu}{\sigma\sqrt{n}})}]^n$
$= (M_{\frac{X - \mu}{\sigma\sqrt{n}}}(t))^n$
- Since $(M_{\frac{X - \mu}{\sigma\sqrt{n}}}(t))^n = E[e^{(\frac{X - \mu}{\sigma\sqrt{n}})(t)}] = E[e^{(X - \mu)(\frac{t}{\sigma\sqrt{n}})}]$, we can move the variables around
$= (M_{X - \mu}(\frac{t}{\sigma\sqrt{n}}))^n$

We don’t know the distribution function of $X - \mu$ so we can’t find the MGF in the same way as with the normal distribution. So we will write out the Taylor expansion of the inside of the exponential, $M_{X - \mu}(\frac{t}{\sigma\sqrt{n}})$.

Recall the Taylor series of a function $f(x)$ that is differentiable at $a$ is:$\sum_{n=0}^{\infty}\frac{f^{(n)}(a)}{n!}(x - a)^n = f(a) + \frac{f'(a)}{1!}(x - a) + \frac{f''(a)}{2!}(x - a)^2 + … + \frac{f^{(n)}(a)}{n!}(x - a)^n$

The Taylor expansion for $M_{X - \mu}(\frac{t}{\sigma\sqrt{n}})$ at $t = 0$:

$= M_{X - \mu}(0) + \frac{M'_{X - \mu}(0)}{1!}(\frac{t}{\sigma\sqrt{n}} - 0) + \frac{M''_{X - \mu}(0)}{2!}(\frac{t}{\sigma\sqrt{n}} - 0)^2 + …$
We are not going to expand this further because we are interested in this function as sample size $n \rightarrow \infty$ in which the subsequent terms will be negligible in comparison in the final equation. From now on we’ll use $\approx$ to denote approximate equivalence only for very large $n$
$\approx M_{X - \mu}(0) + \frac{M'_{X - \mu}(0)}{1!}(\frac{t}{\sigma\sqrt{n}} - 0) + \frac{M''_{X - \mu}(0)}{2!}(\frac{t}{\sigma\sqrt{n}} - 0)^2$
- Recall how we get the $i^{th}$ moment from a MGF. $M^{(i)}_{X - \mu}(\frac{t}{\sigma\sqrt{n}})$ will yield $E[(X - \mu)^i]$, the $i^{th}$ moment of $X - \mu$
$\approx 1 + E[X - \mu] (\frac{(\frac{t}{\sigma\sqrt{n}})}{1!}) + E[(X - \mu)^2] (\frac{(\frac{t}{\sigma\sqrt{n}})^2}{2!})$
- $E[X - \mu] = 0$, since $E[X] = E[\mu]$
- $E[(X - \mu)^2] = Var(X) = \sigma^2$
$\approx 1 + 0 + \frac{\sigma^2(\frac{t}{\sigma\sqrt{n}})^2}{2}$
$\approx 1 + \frac{t^2}{2n}$

Now we put this result $M_{X - \mu}(\frac{t}{\sigma\sqrt{n}}) \approx 1 + \frac{t^2}{2n}$ back into $M_Z(t) = (M_{X - \mu}(\frac{t}{\sigma\sqrt{n}}))^n$.$M_Z(t) = (M_{X - \mu}(\frac{t}{\sigma\sqrt{n}}))^n$

$\approx (1 + \frac{t^2}{2n})^n$
- Recall that $\lim_{n \to \infty}(1 + \frac{x}{n})^n = e^x$
$= e^{\frac{t^2}{2}}$, as $n \rightarrow \infty$

Conclusion: Comparing the MGFs

Now we have:

$M_{K \sim N(0, 1)}(t) = e^{\frac{t^2}{2}}$
$M_{Z}(t) \rightarrow e^{\frac{t^2}{2}}$, as $n \rightarrow \infty$

So $M_{Z}(t) \rightarrow M_{K \sim N(0, 1)}(t)$ as $n \rightarrow \infty$, which means that the distribution of the normalized sample mean approaches the standard normal distribution as sample size approaches infinity. This concludes the informal proof.

Acknowledgments

Special thanks to the following resource which was helpful for me. The proof on this page uses the same strategy of comparing moment generating functions as theirs:

Central Limit Theorem proof by Andrew Rothman: https://towardsdatascience.com/central-limit-theorem-proofs-actually-working-through-the-math-a994cd582b33

What is the Central Limit Theorem?#

Proof using moment generating functions (MGFs)#

Pre-requisites: What is a moment, and what is a moment generating function?#

Pre-requisites: Properties of expectations and variances#

Strategy for proof#

Finding the MGF for N(0, 1)#

Aside: Finding the MGF for $N(\mu, \sigma^2)$ for fun#

Finding the MGF for the normalized sample mean#

Conclusion: Comparing the MGFs#

Acknowledgments#