Inferential Statistics
- Inferential Statistics – Definition, Types, Examples, Formulas
- Observational Studies and Experiments
- Sample and Population
- Sampling Bias
- Sampling Methods
- Research Study Design
- Population Distribution, Sample Distribution and Sampling Distribution
- Central Limit Theorem
- Point Estimates
- Confidence Intervals
- Introduction to Bootstrapping
- Bootstrap Confidence Interval
- Paired Samples
- Impact of Sample Size on Confidence Intervals
- Introduction to Hypothesis Testing
- Writing Hypotheses
- Hypotheses Test Examples
- Randomization Procedures
- p-values
- Type I and Type II Errors
- P-value Significance Level
- Issues with Multiple Testing
- Confidence Intervals and Hypothesis Testing
- Inference for One Sample
- Inference for Two Samples
- One-Way ANOVA
- Two-Way ANOVA
- Chi-Square Tests
Chi-Square Tests
Chi-Square (Χ²) Tests – Types, Formula and Examples
The chi-square test is a statistical method used to test the relationship between two categorical variables. It is often used to analyze data from surveys, experiments, and other studies in which the variables of interest are categorical.
The chi-square test involves calculating a test statistic that measures the difference between the observed frequencies of the categories in the data and the frequencies that would be expected if the two variables were independent. The test statistic is then compared to a critical value from the chi-square distribution with degrees of freedom equal to (r – 1) x (c – 1), where r is the number of rows in the contingency table and c is the number of columns.
If the calculated test statistic is greater than the critical value, then there is evidence to suggest that the two variables are related. Conversely, if the calculated test statistic is less than the critical value, then there is no evidence to suggest that the two variables are related.
There are different types of chi-square tests, including the chi-square goodness-of-fit test, the chi-square test for independence, and the chi-square test for homogeneity. Each of these tests is designed to answer a different research question, but they all use the same underlying chi-square distribution to determine statistical significance.
Chi-square tests have some assumptions that need to be met, including the sample size being sufficiently large, the expected frequencies in each category being greater than or equal to 5, and the categories being mutually exclusive and exhaustive.
Overall, chi-square tests are a useful tool for analyzing categorical data and can provide important insights into the relationship between two variables.
In this tutorial we will see how to compare the proportions of more than two independent groups and how to test for a relationship between two categorical variables.
- Categorical
- Names or labels (i.e., categories) with no logical order or with a logical order but inconsistent differences between groups, also known as qualitative.
- Quantitative
- Numerical values with magnitudes that can be placed in a meaningful order with consistent intervals, also known as numerical.
In this lesson we will be examining methods for analyzing categorical variables.
What is Pearson’s chi-square (Χ2)?
Pearson’s chi-square (Χ2) is a statistical test used to determine the degree of association between two categorical variables. It is named after its developer, Karl Pearson, a British mathematician and statistician.
In this test, the observed frequencies of two variables are compared with the expected frequencies to see whether they are significantly different. The expected frequencies are calculated assuming that there is no association between the variables. The degree of difference between the observed and expected frequencies is measured by the chi-square statistic (Χ2).
The formula for calculating the chi-square statistic is:
Χ2 = Σ (O – E)² / E
- where O is the observed frequency,
- E is the expected frequency,
- and Σ represents the sum of the values for all possible categories.
The chi-square statistic is compared to a critical value from a chi-square distribution with degrees of freedom equal to the number of categories minus one. If the calculated chi-square value exceeds the critical value, the null hypothesis (i.e., there is no association between the variables) is rejected, and it is concluded that there is a significant association between the variables.
The chi-square test is widely used in many fields, such as biology, social sciences, marketing research, and epidemiology, to investigate the relationship between two categorical variables.
What is the difference between Pearson’s chi-square (Χ2) vs Chi-square?
There is no difference between Pearson’s chi-square (Χ2) and chi-square, as Pearson’s chi-square is a specific type of chi-square test. The chi-square test is a statistical test that measures whether two categorical variables are independent of each other. It is commonly used to test the goodness of fit of a distribution or to test for association between two categorical variables.
Pearson’s chi-square test is a specific type of chi-square test that is used when the expected frequencies for each category are known, or can be calculated, and when the sample size is large. This test is also known as the chi-square goodness-of-fit test.
In Pearson’s chi-square test, the observed frequencies are compared to the expected frequencies under the assumption that the two variables being tested are independent of each other. The chi-square statistic is calculated by taking the sum of the squared differences between the observed and expected frequencies, divided by the expected frequencies.
The general chi-square test, on the other hand, does not require knowledge of the expected frequencies, and can be used to test for independence between two categorical variables. The chi-square statistic is calculated in a similar way, by comparing the observed frequencies to the expected frequencies under the null hypothesis of independence.
In summary, Pearson’s chi-square test is a specific type of chi-square test that is used when the expected frequencies are known, while the general chi-square test can be used when the expected frequencies are not known or are difficult to calculate.
When to use a chi-square test?
A chi-square test is typically used when you have two categorical variables and want to determine if there is a significant association or relationship between them. More specifically, the chi-square test can be used to:
Test for independence: You can use a chi-square test to determine whether two categorical variables are independent of each other. For example, you might want to know if there is a relationship between gender and voting preference in a political election.
Test for goodness of fit: You can use a chi-square test to determine whether a set of observed data fits a particular theoretical or expected distribution. For example, you might want to know whether the number of people with a certain blood type in a sample matches the expected proportions based on population data.
Compare proportions: You can use a chi-square test to compare the proportions of observations in two or more categories. For example, you might want to know whether there is a difference in the proportion of men and women who prefer a particular brand of soda.
It is important to note that the chi-square test assumes that the data are categorical and that the observations are independent of each other. Additionally, the test is more robust when the sample size is large and the expected cell frequencies are not too small. If these assumptions are violated or the sample size is small, alternative statistical tests may be more appropriate.
Types of chi-square tests
The two types of Pearson’s chi-square tests are:
- Chi-square goodness of fit test
Chi-square test of independence
Mathematically, these are actually the same test. However, we often think of them as different tests because they’re used for different purposes.
Here is a chi-square tests with an example
Suppose you are interested in whether there is a relationship between a person’s level of education and their political party affiliation. You conduct a survey of 500 individuals, asking them to report their highest level of education completed (categorical variable with categories: high school, some college, college degree, and graduate degree) and their political party affiliation (categorical variable with categories: Democrat, Republican, Independent, Other).
You can use a chi-square test to determine whether there is a significant association between these two variables. Here are the steps you would follow:
- Set up hypotheses: The null hypothesis would be that there is no association between level of education and political party affiliation, while the alternative hypothesis would be that there is an association.
- Determine expected frequencies: Calculate the expected frequencies for each category assuming that there is no association between the two variables. For example, if you assume that 30% of the population is Democrat, and 20% of those with a college degree are Democrats, then you would expect 0.3 * 0.2 * 500 = 30 individuals to fall into the category of having a college degree and being a Democrat.
- Calculate the chi-square statistic: The chi-square statistic is calculated as the sum of the squared differences between the observed and expected frequencies, divided by the expected frequencies. The formula is:X^2 = Σ [(Observed frequency – Expected frequency)^2 / Expected frequency]
- Determine the degrees of freedom: The degrees of freedom for a chi-square test is calculated as (r – 1) * (c – 1), where r is the number of rows and c is the number of columns in the contingency table.
- Determine the p-value: The p-value is the probability of observing a chi-square statistic as extreme as the one calculated, assuming the null hypothesis is true. You can look up the p-value in a chi-square distribution table using the degrees of freedom and chi-square statistic.
- Make a decision: If the p-value is less than the level of significance (e.g., 0.05), you reject the null hypothesis and conclude that there is a significant association between the two variables. If the p-value is greater than the level of significance, you fail to reject the null hypothesis.
As an example, suppose the observed frequencies for the four education levels and four political party affiliations are as follows:
Democrat | Republican | Independent | Other | |
---|---|---|---|---|
High school | 80 | 60 | 70 | 40 |
Some college | 70 | 80 | 100 | 50 |
College degree | 50 | 90 | 110 | 30 |
Graduate degree | 40 | 100 | 80 | 30 |
You can calculate the expected frequencies assuming no association between the two variables, and then use those values to calculate the chi-square statistic. Suppose the expected frequencies are as follows:
Democrat | Republican | Independent | Other | Row total | |
High school | 80 | 60 | 70 | 40 | 250 |
Some college | 70 | 80 | 100 | 50 | 300 |
College degree | 50 | 90 | 110 | 30 | 280 |
Graduate degree | 40 | 100 | 80 | 30 | 250 |
Column total | 240 | 330 | 360 | 150 | 1080 |
To calculate the expected frequency for each cell, we use the formula:
Expected frequency = (row total x column total) / grand total
Expected frequency for High school/Democrat = (250 x 240) / 1080 = 55.56
Expected frequency for High school/Republican = (250 x 330) / 1080 = 76.39
Expected frequency for High school/Independent = (250 x 360) / 1080 = 83.33 and so on.
Democrat | Republican | Independent | Other | |
---|---|---|---|---|
High school | 55.56 | 76.39 | 83.33 | 34.72 |
Some college | 66.67 | 91.67 | 100 | 41.67 |
College degree | 62.22 | 85.56 | 93.33 | 38.89 |
Graduate degree | 55.56 | 76.39 | 83.33 | 34.72 |
Then, the chi-square statistic can be calculated as:
Chi-square = [(80-55.56)^2/55.56] + [(60-76.39)^2/76.39] + [(70-83.33)^2/83.33] + [(40-34.72)^2/34.72] + [(70-66.67)^2/66.67] + [(80-91.67)^2/91.67] + [(100-100)^2/100] + [(50-41.67)^2/41.67] + [(50-62.22)^2/62.22] + [(90-85.56)^2/85.56] + [(110-93.33)^2/93.33] + [(30-38.89)^2/38.89] + [(40-55.56)^2/55.56] + [(100-76.39)^2/76.39] + [(80-83.33)^2/83.33] + [(30-34.72)^2/34.72]
Chi-square = 29.603
Using the degrees of freedom formula (df = (4-1) * (4-1) = 9) and a chi-square distribution table, we need to compare it to a critical value from the chi-square distribution with 9 degrees of freedom and a significance level of 0.05. Using a chi-square distribution table or calculator, we find the critical value to be 16.919.
Since our calculated chi-square value (29.603) is greater than the critical value (16.919), we reject the null hypothesis and conclude that there is a statistically significant association between political affiliation and education level. In other words, we can conclude that education level and political affiliation are not independent of each other.