Inferential Statistics
- Inferential Statistics – Definition, Types, Examples, Formulas
- Observational Studies and Experiments
- Sample and Population
- Sampling Bias
- Sampling Methods
- Research Study Design
- Population Distribution, Sample Distribution and Sampling Distribution
- Central Limit Theorem
- Point Estimates
- Confidence Intervals
- Introduction to Bootstrapping
- Bootstrap Confidence Interval
- Paired Samples
- Impact of Sample Size on Confidence Intervals
- Introduction to Hypothesis Testing
- Writing Hypotheses
- Hypotheses Test Examples
- Randomization Procedures
- p-values
- Type I and Type II Errors
- P-value Significance Level
- Issues with Multiple Testing
- Confidence Intervals and Hypothesis Testing
- Inference for One Sample
- Inference for Two Samples
- One-Way ANOVA
- Two-Way ANOVA
- Chi-Square Tests
Central Limit Theorem
What is the central limit theorem?
The Central Limit Theorem (CLT) is a fundamental concept in statistics and probability theory. It states that the sum or average of a large number of independent and identically distributed random variables (with finite mean and variance) will converge to a normal distribution, regardless of the underlying distribution of the individual variables.
In other words, the CLT asserts that, under certain conditions, the distribution of the sample mean or sum will be approximately normal, even if the distribution of the individual observations is not normal. This is why the normal distribution is often used to model the behavior of many real-world phenomena, even when the underlying process generating the data is not known to be normally distributed.
The importance of the CLT lies in its applications to statistical inference. It allows us to make inferences about the population based on a sample, as well as to estimate the sampling distribution of a statistic such as the sample mean or standard deviation. The CLT also provides a theoretical basis for hypothesis testing and confidence interval estimation.
Central limit theorem formula
The formula for the mean of the sampling distribution of the sample mean is:
E(Y) = μ
where Y is the sample mean, and μ is the population mean. This means that, on average, the sample means will be equal to the population mean, and any deviations from the population mean will be due to random sampling error.
The formula for the standard deviation of the sampling distribution of the sample mean is:
SD(Y) = σ / sqrt(n)
where Y is the sample mean, σ is the population standard deviation, and n is the sample size.
We can represent the sampling distribution of the mean using the following notation:
Y̅ ~ N(μ, σ/√n)
Where:
- Y̅: The sample mean, which is the mean of the values in a single sample of size n drawn from the population.
- ~ : This symbol means “is distributed as”. So, when we write Y̅ ~ N(μ, σ/√n), we mean that the sample mean Y̅ is distributed as a normal distribution with mean μ and standard deviation σ/√n.
- N: This symbol stands for the normal distribution, which is a continuous probability distribution that has a bell-shaped curve.
- μ: The population mean, which is the average value of the variable in the entire population.
- σ: The population standard deviation, which is a measure of the amount of variability or spread in the population.
- √: The square root symbol, which is used to indicate the square root of a number.
- n: The sample size, which is the number of observations in a single sample drawn from the population.
We can describe the sampling distribution of the mean using this notation as well:
The above notation tells us three things
1. The shape of the distribution of sample statistics is nearly normal. It means if we plot the sample statistics in a histogram we will get a bell shape curve.
2. The center of the sampling distribution will equal to population mean.
3. And the spread of the sampling distribution which we measured by standard error will be equal to standard deviation of population divided by square root of the sample size. Sometimes we don’t know about population standard deviation sigma or often we don’t have access to the whole population then we calculate standard deviation of sampling distribution which is also called standard error by using sample standard deviation S.
Conditions of the central limit theorem
The Central Limit Theorem holds under the following conditions:
1. The random variables being averaged or summed are independent and identically distributed (iid). In other words, the variables are drawn from the same distribution and are not affected by one another.
2. The variables have a finite mean and variance. This means that the distribution from which they are drawn has a finite mean and variance, which implies that the variables are not too spread out or too skewed.
3. The sample size is sufficiently large. The sample size needs to be large enough so that the sample mean or sum can be approximated by a normal distribution.
The exact minimum sample size needed to approximate the distribution with a normal distribution depends on the shape of the underlying distribution, but a rule of thumb is that a sample size of 30 or greater is often sufficient.
The distribution of sample statistics is nearly normal, centered at the population mean, and with a standard deviation equal to the population standard deviation divided by square root of the sample size.
Conditions for the CLT – Summary
1. Independence:
- Sampled observations must be independent.
- Random sample/random assignment
- If sampling without replacement, then needs to be n < 10% of population.
If we grab a very big portion of sample from the population sometimes it very difficult to make sure that sample individuals are independent of each other. Suppose you are doing some research on genetic applications and out of 10,000 people you surveyed 5000 from your city. Then it’s very difficult to make sure that your data in not genetically independent. Because you included 50% of the total population. Though we like large sample but we want to keep that limit somewhat proportional to our population and a good rule of thub is usually if we do sampling without replacement then n should not be more than 10% of the population.
2. Sample size/ skew:
- Either the population distribution is normal, or if the population distribution is skewed, the sample size is large (rule of thumb: n > 30)
Explanation of the central limit theorem
Frequently we interested in μ and estimate it using , so we need to know about the sampling distribution .
Theory says that for random samples of size n from any population
Mu-X-bar is equal to Mu. Mu stands for the population mean, and Mu-X-bar stands for the mean of the sampling distribution of all the sample mean. The standard deviation of the sampling distribution is symbolized by Sigma-X-bar and is equal to Sigma divided by the square root of n. The X-bar is added to make clear that we are talking about the standard deviation of the sampling distribution in which the scores are sample means, or in other words, X-bars. Sigma stands for the standard deviation in the population. And n stands for the sample size.
According to CLT, if n is sufficiently large (≥30) the sampling distribution of will also be approximately normal.
If you draw an infinite number of samples from a bell-shaped population distribution, the distribution of means from this infinite number of samples will be bell-shaped, and the mean of this distribution of sample means will be exactly the same as the population mean. We call this distribution the sampling distribution of the sample mean.
So, the central limit theorem says that, provided that the sample size is sufficiently large, the sampling distribution of sample mean X-bar has an approximately normal distribution. Even if the variable of interest is not normally distributed in the population! Isn’t that cool? No matter how a variable is distributed in the population, the sampling distribution of the sample mean is always approximately normal, as long as the sample size is large enough. As a guideline for ‘large enough’ a sample size of 30 or larger is often used as we discussed before.
Example of the Central Limit Theorem in Practice:
For example, imagine you wanted to know the average height of all the people in a city. It would be impractical and time-consuming to measure the height of every single person, so you might take a sample of 100 people and calculate their average height. Then you might take another sample of 100 people and calculate their average height, and so on.
According to the Central Limit Theorem, as the sample size increases, the distribution of the sample means will approach a normal distribution, regardless of the distribution of the original population. This means that the average height of the city’s population can be estimated using the means of multiple samples, and the estimated mean height will follow a normal distribution, making it possible to calculate confidence intervals and make statistical inferences.
Another example could be, Roll 30 dice and calculate the average (sample mean) of the numbers that you get on each die. Now repeat this experiment 1000 times each time rolling 30 dice and computing a new sample mean. Plot a histogram of the 1000 sample means that you have obtained. This plot will look approximately normal.
Example 2:
Manufacturer claims the life of a battery type has a mean of 54 months and std dev of 6 months. A consumer group purchases a sample of 50 of these batteries and tests them. They find an average life of 52 months, what should they conclude?
If manufacturer’s claim is true
n>30 so use central limit theorem.
If the manufacturer’s claim is true then what the consumer group observed is very unlikely – a more plausible explanation is that the true value of μ is less than 54.
Central limit theorem example 3
Suppose we want to estimate the average height of all students in a university. We randomly sample 100 students from the university and record their heights in centimeters. We find that the mean height in our sample is 170 cm, and the standard deviation is 5 cm.
Now, let’s assume that the population of heights is not normally distributed, but rather has a skewed distribution. Despite this, we can still apply the central limit theorem to estimate the mean height of all students in the university.
First, we check the conditions of the central limit theorem. We assume that the heights of the students are independent of each other, and the sample size of 100 is sufficiently large. We also assume that the population variance is finite, although we do not know the actual value of the population variance.
Next, we calculate the standard error of the mean, which is equal to the standard deviation of the population divided by the square root of the sample size:
Standard error of the mean = 5 / sqrt(100) = 0.5
Using the central limit theorem, we can say that the sampling distribution of the sample means will be approximately normal with a mean of 170 cm and a standard deviation of 0.5 cm. We can use this to calculate the probability of finding a sample mean within a certain range.
For example, we can calculate the probability that the average height of all students in the university is between 169.5 and 170.5 cm:
Z-score for lower limit = (169.5 – 170) / 0.5 = -1
Z-score for upper limit = (170.5 – 170) / 0.5 = 1
Using a standard normal distribution table, we find that the probability of a z-score between -1 and 1 is approximately 0.68. Therefore, we can say that there is a 68% chance that the true average height of all students in the university falls within this range.
In this way, the central limit theorem allows us to make statistical inferences about a population even when the underlying distribution is not normal
When to Apply the Confidence Interval Formulas
Be careful now, when can we use these? In what situation are these confidence intervals applicable?
These approximate intervals above are good when n is large (because of the Central Limit Theorem), or when the observations y1, y2, …, yn are normal.
Sample size 30 or greater
When the sample size is 30 or more, we consider the sample size to be large and by Central Limit Theorem, ȳ will be normal even if the sample does not come from a Normal Distribution. Thus, when the sample size is 30 or more, there is no need to check whether the sample comes from a Normal Distribution. We can use the t-interval.
Sample size 8 to 29
When the sample size is 8 to 29, we would usually use a normal probability plot to see whether the data come from a normal distribution. If it does not violate the normal assumption then we can go ahead and use the t-interval.
Sample size less than 7
However, when the sample size is 7 or less, if we use a normal probability plot to check for normality, we may fail to reject Normality due to not having enough sample size. In the examples here in these lessons and in the textbook we typically use small sample sizes and this might be the wrong image to give you. These small samples have been set for illustration purposes only. When you have a sample size of 5 you really do not have enough power to say the distribution is normal and we will use nonparametric methods instead of t.