Simulating and Visualising the Central Limit Theorem Categories: Statistics R 34 minutes read I completed a Computer Science degree at uni, and bundled a lot of maths subjects in as electives: partial differential equations, vector calculus, discrete maths, linear algebra. For some reason however I always avoided statistics subjects. Maybe there’s a story to be told about a young person finding uncertainty uncomfortable, because twenty years later I find statistics, particularly the Bayesian flavour, really interesting. One problem with a self-directed journey is that there’s foundational knowledge that has come to me in dribs and drabs, and one of the most foundational is the Central Limit Theorem (CLT). In this post I want to interrogate and explore the CLT using simulation and visualisation in an attempt to understand how it works in practice, not in theory. This is predominantly a process to help me better understand the CLT; you’re just here for the ride. Hopefully that ride can help you get where you need to go as well. It’s been awhile since I’ve included any code in a post, so where it makes sense I’ll show the generating R code, with a liberal sprinking of comments so it’s hopefully not too inscrutable. A Brief Recap I don’t want this to be like an online recipe with pages of back story before you get to the meat and bones, but a brief summary of the CLT before we begin is unavoidable. In plain English the CLT can be described as such: “If you take repeated samples of size n from a distribution and calculate the sample mean for each, as n gets approaches infinity, the distribution of sample means approaches a normal distribution.” For the classic CLT there’s a couple of assumptions about the source distribution: The sample is drawn independently (no autocorrelation like in a time series). All the data points are drawn from the same distribution (“independent and identically distributed” or i.i.d). The distribution has a finite mean and variance (e.g. no Cauchy or Pareto distributions). There are other versions of the CLT in which some of these assumptions are relaxed, but we’ll focus on the ‘classic’ version. Putting it math terms: $$ \frac{\bar{X}_n - \mu}{\sigma/\sqrt{n}} \overset{d}\longrightarrow \mathcal{N}(0, 1) $$ Simulating We’re not the kind of people to just accept something because someone threw some fancy Greek symbols at us. We want to simulate it to give ourselves some confidence it works in practice. Let’s create a tibble of ten-thousand random values from six different distributions, which we’ll call our ‘population’: # 10,000 draws from six different distributions population_data <- tibble ( uniform = runif ( 10000 , min = -20 , max = 20 ), normal = rnorm ( 10000 , mean = 0 , sd = 4 ), binomial = rbinom ( 10000 , size = 1 , prob = .5 ), beta = rbeta ( 10000 , shape1 = .9 , shape2 = .5 ), exponential = rexp ( 10000 , .4 ), chisquare = rchisq ( 10000 , df = 2 ), ) Six 'Population' Distributions - Ten-Thousand Values - First Five Rows uniform normal binomial beta exponential chisquare -15.243338 -0.7022216 0 0.9473270 2.34764246 2.8571390 -9.788476 -3.2383413 1 0.0730441 6.00295709 0.9272733 -11.432831 -2.0977844 0 0.5160286 0.91033272 3.1607328 18.142428 2.5360707 1 0.9428468 3.08170008 1.9489974 -7.078033 -1.7224620 0 0.6134417 0.08687766 2.5317882 This ‘wide’ tibble is good for sampling from, but we’ll also transform it into a long version which will have other uses (note the ’_l’ suffix). # Long version of random data population_data_l <- population_data |> pivot_longer (cols = everything (), names_to = 'distribution' ) Six Distributions - Post 'pivot_longer() - First Value of Each distribution value beta 0.9473270 binomial 0.0000000 chisquare 2.8571390 exponential 2.3476425 normal -0.7022216 uniform -15.2433376 Here’s a histogram of each of the population distributions: Let’s define a function take_random_sample_mean() which takes a sample from all of the population distributions and calculates the mean. If we use this function repeatedly, we should end up with a data set that demonstrates the central limit theorem. Let’s take 20,000 sample means of size 60, bind it all together into a single data frame, and shape it into a long version. # Define a function to take a random sample from our data take_random_sample_mean <- function (data, sample_size) { slice_sample (.data = data, n = sample_size) |> summarise ( across ( everything (), list (sample_mean = mean, sample_sd = sd))) } # Draw 20,000 means of size 60 from our random data sample_size <- 60 sample_means <- map ( 1 : 20000 , ~ take_random_sample_mean (population_data, sample_size = sample_size)) |> # Bind the sample means into a single tibble list_rbind () |> # Move to a long version of the data pivot_longer ( everything (), names_to = c ( 'distribution' , '.value' ), names_pattern = '(\\w+)_(\\w+_\\w+)' ) Here’s the resulting histograms of the samples with only the x-axis free to change scale. The beta and binomial look pretty normal, and the others might be as well, but the differences in variance make it difficult to tell. If you recall the formula at the start, to get to a standard normal we also need to subtract the population mean and divide by the population standard deviation over the square-root of n. Using the long version of the population data we can calculate the population mean and SD statistics: population_data_stats <- population_data_l |> group_by (distribution) |> summarise ( population_mean = mean (value), population_sd = sd (value), n = n () ) Random Means - Statistics distribution population_mean population_sd n beta 0.64352237 0.3105243 10000 binomial 0.50260000 0.5000182 10000 chisquare 2.03125146 2.0189741 10000 exponential 2.48144359 2.5049145 10000 normal -0.03205698 3.9813502 10000 uniform -0.09148261 11.6405771 10000 Joining each of those statistics into the sample means by their distribution allows us to nicely standardise. # CLT Calculation clt <- sample_means |> # Join in the popultion mean, sd, and n left_join (population_data_stats, by = 'distribution' ) |> # Scale to the standard normal mutate ( clt = (sample_mean - population_mean) / (population_sd / sqrt (sample_size) ) ) That’s better: they now at least all look the same, except for the binomial which ends up having higher counts because of its discrete rather than continuous values. While you can kind of guess that they’re normal from a histogram, we can get a better sense quantile-quantile plot. The standard normal quantiles are on the x-axis, and our sample mean quantiles are on the y-axis. They all track pretty closely to a standard normal, however the chi-square and exponential do tend to diverge slightly at the ends. We’ll dive a bit deeper into that later in the post. Let’s summarise: we took twenty-thousand sample means of size sixty from six wildly different distributions, and we were able to see that the distributions of these sample means approximately followed a normal distribution. This is the essence of the central limit theorem, which allows us to use statistical methods for normal distributions on problems that involve population distributions that aren’t normal. In Practice, With Mistakes That’s all well and good if you’ve got the population mean and standard deviation, but in most cases you’re not going to have that. You’re also likely not going to have the resources to take twenty-thousand different samples. Let’s take a a pretty classic example of interviewing six people about about something (weight, height, voting intentions, etc) and take the average: your sample mean. What the CLT can give you is an interval around your sample mean that would give you the a confidence interval (we’ll use the classic 95%) that the true mean of the population is somewhere in the interval. So you use a bit of algebra and move some terms around in the CLT formula and you get this: $$ \bar{X} \pm z_.025 \cdot \ \frac{s}{\sqrt{n}} $$ where \(\bar{X}\) is your sample mean, \(z_.025\) is the critical value (aka qnorm() in R) and \(s\) is the sample standard deviation. The sample standard deviation over the square root n is more commonly known as the standard error. More simply, we take the sample mean and add/subtract the 97.5 quantile from a normal distribution times the standard error to get our 95% interval. Some may have noted a small mistake I’ve made here, which I’’ll leave in but will soon come to light. We’re in frequentist territory here, so while it’s tempting to say “there’s a 95% probability that the true mean is in the interval”, we shouldn’t. Why? Because to a frequentist, the true mean \(\mu\) is a fixed value: it’s either in the interval or it’s not, we can’t assign a probability to the population mean. What we should say is that if we were to repeat the sampling process many times, in the long run we should see the true mean within this confidence interval 95% of the time. We can simulate this to see if that is true. We use a sample size of 6, taking the mean of these 6 samples from each of our population distributions 10,000 times. # Ten-thousand random sample means of size six small_sample_size <- 6 repeated_samples <- map ( 1 : 10000 , ~ take_random_sample_mean (population_data, sample_size = small_sample_size)) |> # Bind the samples into a single tibble list_rbind () |> # Wide to long tibble pivot_longer ( everything (), names_to = c ( 'distribution' , '.value' ), names_pattern = '(\\w+)_(\\w+_\\w+)' ) |> # Join with our population mean left_join (population_data_stats, by = 'distribution' ) For each of the samples from the distributions we calculate the 95% confidence interval. Because we know the true population mean, we can determine whether it is or is not within the interval. Then we calculate the percentage of samples for which the population mean fell within the 95% interval. conf_intervals <- repeated_samples |> # Calculate CIs and whether CI contains the true population mean mutate ( upper_ci = sample_mean + qnorm ( 0.975 ) * sample_sd / sqrt (small_sample_size), lower_ci = sample_mean - qnorm ( 0.975 ) * sample_sd / sqrt (small_sample_size), within_ci = population_mean < upper_ci & population_mean > lower_ci ) |> # Determine percentage of CI that contain the true mean group_by (distribution) |> summarise (percent_within_ci = mean (within_ci)) Distribution % Within CI normal 89.36% uniform 88.95% beta 87.87% chisquare 83.32% exponential 82.87% binomial 77.90% Uh oh! We’re way off our 95% here, what happened? This is the mistake I was talking about earlier: we used a normal to calculate the CIs. We estimated the population standard deviation \(\sigma\) using our sample standard deviation \(s\). With our small sample size of 6, our CLT formula follows a t-distribution, not a normal. We’ll re-run our confidence intervals, but this time we use qt() , the t-distribution quantile function instead of qnorm() . conf_intervals <- repeated_samples |> mutate ( upper_ci = sample_mean + qt (p = 0.975 , df = small_sample_size - 1 ) * (sample_sd / sqrt (small_sample_size)), lower_ci = sample_mean - qt (p = 0.975 , df = small_sample_size - 1 ) * (sample_sd / sqrt (small_sample_size)), within_ci = population_mean < upper_ci & population_mean > lower_ci ) |> group_by (distribution) |> summarise (percent_within_ci = mean (within_ci)) Distribution % Within CI binomial 96.83% normal 95.05% uniform 93.88% beta 92.59% chisquare 88.91% exponential 88.47% That looks a bit better! The binomial is higher than 95% due to its discrete nature (can’t smoothly hit all possible values), the normal and uniform distributions are close to our 95% value, but our heavily skewed beta, exponential a chi-square still aren’t up to scratch. With those heavily skewed population distributions, we need more samples before the central limit theorem ‘kicks in’. Let’s re-run using the dataset from the start of the post, which used s sample size of 60. conf_intervals <- # Using our original sample size of 60 sample_means |> left_join (population_data_stats, by = 'distribution' ) |> mutate ( upper_ci = sample_mean + qnorm ( 0.975 ) * (sample_sd / sqrt (sample_size)), lower_ci = sample_mean - qnorm ( 0.975 ) * (sample_sd / sqrt (sample_size)), within_ci = population_mean < upper_ci & population_mean > lower_ci ) |> group_by (distribution) |> summarise (percent_within_ci = mean (within_ci)) Distribution % Within CI binomial 94.86% normal 94.52% beta 94.40% uniform 94.36% chisquare 93.60% exponential 92.97% Better, but even with a sample size of 60, the skewed distirbutions are still slightly under the mark. This begs the question: how do the sample means of these skewed distributions behave as the sample size increased? What we can do is repeatedly sample, but increase the sample size each time. We’ll go up in powers of two from 1 to 1024 samples. increasing_sample_size <- # Sample sizes increasing in powers of 2 (1, 2, 4, 8, ...) map ( 2 ^ ( 0 : 10 ), # Anonymous function \ (y) { # 1000 sample means map ( 1 : 1000 , ~ take_random_sample_mean (population_data, sample_size = y)) |> # Bind them all together list_rbind () |> # Wide to long per distribution pivot_longer ( everything (), names_to = c ( 'distribution' , '.value' ), names_pattern = '(\\w+)_(\\w+_\\w+)' ) |> # Add in our population means and SDs left_join (population_data_stats, by = 'distribution' ) |> # Add sample size as a column mutate (sample_size = y) |> # Normalise to -> N(0,1) mutate (clt = (sample_mean - population_mean) / (population_sd / sqrt (sample_size) )) } ) |> list_rbind () With this data we can create an animation of a Q-Q plot for the uniform versus exponential distributions, showing how the distribution of sample means changes as the sample size increases. You’ll see the distribution approaching the standard normal distribution. ou’ll see the distribution of sample means from the uniform distribution approaches a normal much faster than the expontential. It’s very subjective, but I think the uniform stsrts looking reasonably good at a sample size of 8. The exponential however takes much longer to converge to a normal. Summary The central limit theorem is something I’ve read about many times and had a reasonable grasp on, but I was always wondering about how it behaved with different population distributions. Running simulations and visualising how it behaved in different scenarios has given me (and hopefully yourselves) a much clearer view on how it works, and probably more importantly where it doesn’t work well.