The central limit theorem is one of the most fundamental concepts in all of mathematics. It allows us to test ideas about the world that we wouldn’t be able to otherwise. Consider for example, if we wanted to know the mean household income for the United States. To get an exact answer, we would need to survey all 125 million households and ask them what their income is.
Assuming we employ 1000 people to call households, working 8 hours a day, finishing a call every 2 minutes, our survey would take about 1.5 years. Even if we got an answer, the mean would have probably changed during the time it took. Online surveys could accomplish this quicker, but that is a discussion for another day. Since this method isn’t practical, how can we say anything meaningful about incomes in the United States? Similarly, how can we determine a drugs usefulness in fighting a disease without testing it on everyone who has the disease?
This post will provide the answer.
The standard textbook definition of the central limit theorem is:
If all possible random samples of size N are drawn from a population with mean μY and a standard deviation σY, then as N becomes larger, the sampling distribution of sample means becomes approximately normal, with μyand standard deviation σY / N.
To explain what this means, let’s go back to that U.S. income example. Say we only called 50 households instead all 125 million. Taking a small subset of a larger population is known as a sample. Maybe this sample has 40 households that make $40 thousand, 5 households that make $10 thousand, and 5 households that make more than $1 million. Let’s say that this sample has a mean of $133 thousand.
What if our sample of 50 only contained households that make $1 million? In this case, our mean would have been $1 million. Since sample estimates vary based on the randomness of the sample, how much confidence can we place in any one estimate? Can we infer anything about the population from a single sample? Not really.
However, what if we repeat this process and take note of each sample mean we get? We had one sample with a mean of $133 thousand, one with $1 million, and let’s continue this 1000 times. We would end up with a distribution of sample means that have their own mean. More importantly, this distribution will approach a normal distribution.
The central limit theorem basically tell us that no matter what the distribution of the population is, the distribution of sample means (or any sample statistic) will approach a normal distribution as the sample size increases.
In case the craziness of this is understated; if we take any process in life, no matter how unknowable the underlying probability distribution of that process is, we can add up a bunch of sample statistics from that process, and we get a normal distribution.
Standard error of the mean
In our example, we now have a sampling distribution of sample means for U.S. Income. What is its standard deviation of this distribution? Conceptually, the larger our sample size, the more likely we are to get sample means that are closer to each other. This would reduce the standard deviation. The standard deviation of our sampling distribution of sampling means, or standard error of the mean, is:
Standard error of the mean = stdev(population)/sqrt(n)
Putting it all together
Let’s say we finished sampling 50 households 1000 times and now have a sampling distribution of sample means. Our mean of sample means is $80 thousand. Our standard error, or standard deviation of the mean is $10 thousand. How confident can we be that the true population mean is equal to $80 thousand?
Since the central limit theorem tells us this distribution is normal, we can use a Z table to find probabilities. For example, we can make a 95% confidence interval with a Z score of 1.96:
lower limit = $80 – 1.96($10) = $60.4
Upper limit = $80 + 1.96($10) = $99.6
We can be 95% confident that the true population mean income is between $60.4 thousand and $99.6 thousand.
We were able to take something largely unknowable, the mean income of all households in the United States, and determine the probability that it would lie within a certain range.
We accomplished this by repeatedly taking samples from the population and creating a distribution of the means of those samples. As long as the sample size is about 30 or more, the sampling distribution of the means will be approximately normal.
Since the distribution is normal, we can easily determine the probability that a certain sample mean will fall within a certain distance from the population mean. This allows us to conduct hypothesis tests and create confidence intervals.
The magic of the central limit theorem is that this process is true for a whole range of applications no matter how complicated the underlying population distribution is.