Chi-Square Test of Independence

Chi-Square Test of Independence – Formula, Guide & Examples

The chi-square (χ2) test of independence is a statistical test used to determine whether there is a significant association between two categorical variables. It is also known as the Pearson chi-square test or the contingency table test.

So, a chi-square (Χ2) test of independence is a nonparametric hypothesis test. You can use it to test whether two categorical variables are related to each other. The chi-square test of independence uses this fact to compute expected values for the cells in a two-way contingency table under the assumption that the two variables are independent (i.e., the null hypothesis is true).

Even if two variables are independent in the population, samples will vary due to random sampling variation. The chi-square test is used to determine if there is evidence that the two variables are not independent in the population using the same hypothesis testing logic that we used with one mean, one proportion, etc.

 

What is the chi-square test of independence?

The chi-square (χ2) test of independence is a statistical test used to determine whether there is a significant association between two categorical variables. It is used to test whether the observed frequencies of the categories in one variable are independent of (i.e., not related to) the categories in the other variable.

The test is conducted by constructing a contingency table that shows the frequency of each combination of categories for the two variables. Then, the expected frequencies for each cell in the table are calculated based on the assumption that there is no association between the variables. The expected frequency for a cell is the product of the row total and the column total divided by the total sample size.

The chi-square test statistic is then calculated as the sum of the squared differences between the observed and expected frequencies, divided by the expected frequencies, across all the cells in the table. The degrees of freedom for the test are equal to the product of the number of categories in each variable, minus one.

If the calculated chi-square value is larger than the critical value from the chi-square distribution table, then the null hypothesis of independence between the variables is rejected, and we conclude that there is a significant association between the variables. If the calculated chi-square value is smaller than the critical value, we fail to reject the null hypothesis of independence.

The chi-square test of independence is widely used in fields such as data sciences, social sciences, marketing research, and healthcare to investigate the relationship between two categorical variables.

Contingency tables

A contingency table is a table that displays the frequency distribution of two categorical variables simultaneously. Each cell in the table represents a combination of values for the two variables, and the frequency or count of observations falling in each cell is displayed.

Contingency tables are also known as cross-tabulation tables, crosstabs, or two-way tables. They are commonly used in data analysis to summarize and compare the distribution of variables, especially when studying the association between two categorical variables.

For example, consider a survey of 200 people that asked two questions: “What is your gender?” and “Do you own a car?”. The resulting contingency table could be:

Owns a Car Does Not Own a Car Total
Male 70 30 100
Female 50 50 100
Total 120 80 200

In this table, the rows correspond to gender (male or female) and the columns correspond to car ownership (own a car or do not own a car). The frequencies or counts of people who fall into each combination of gender and car ownership are displayed in the cells. For example, there are 70 males who own a car and 50 females who do not own a car.

Contingency tables can be analyzed using statistical methods such as the chi-square test of independence, which allows us to test whether there is a significant association between the variables.

What is the difference between chi-square (χ2) test of independence and chi-square Goodness of Fit Test?

The chi-square (χ2) test of independence and the chi-square goodness of fit test are both statistical tests that use the chi-square distribution to determine whether there is a significant difference between observed and expected frequencies. However, they are used for different purposes.

The chi-square test of independence is used to determine whether there is a significant association between two categorical variables. It involves comparing the observed frequencies of two categorical variables with the expected frequencies assuming that there is no association between them.

On the other hand, the chi-square goodness of fit test is used to determine whether the observed frequencies in a single categorical variable differ significantly from the expected frequencies. It involves comparing the observed frequencies with the expected frequencies assuming that the observed data follow a specific distribution, such as a normal distribution or a Poisson distribution.

To conduct a chi-square goodness of fit test, we first specify the null hypothesis that the observed data follow a particular distribution. We then calculate the expected frequencies based on that distribution and compare them with the observed frequencies using the chi-square test statistic.

In summary, the chi-square test of independence is used to study the relationship between two categorical variables, while the chi-square goodness of fit test is used to study the fit between observed and expected frequencies for a single categorical variable.

Chi-square test of independence hypotheses

The chi-square (χ2) test of independence is a statistical test used to determine whether there is a significant association between two categorical variables. The null and alternative hypotheses for this test are as follows:

Null Hypothesis (H0): There is no association between the two categorical variables. The observed frequencies in each cell of the contingency table are equal to the expected frequencies assuming no association between the variables.

Alternative Hypothesis (HA): There is a significant association between the two categorical variables. The observed frequencies in at least one cell of the contingency table are different from the expected frequencies, indicating an association between the variables.

To test these hypotheses, we calculate the chi-square test statistic, which measures the difference between the observed frequencies and the expected frequencies under the null hypothesis. If the chi-square statistic is large enough, we reject the null hypothesis and conclude that there is a significant association between the variables. If the chi-square statistic is not large enough, we fail to reject the null hypothesis and conclude that there is not enough evidence to support the presence of an association.

The level of significance for the test is typically set at 0.05, meaning that we are willing to accept a 5% chance of making a Type I error (rejecting the null hypothesis when it is actually true). The degrees of freedom for the test are equal to (r – 1) x (c – 1), where r is the number of rows and c is the number of columns in the contingency table.

In summary, the chi-square test of independence is used to test whether two categorical variables are associated with each other. The null hypothesis assumes no association between the variables, while the alternative hypothesis assumes a significant association. The test compares the observed frequencies in the contingency table to the expected frequencies under the null hypothesis, and the resulting chi-square statistic is used to make a decision about the hypotheses.

When to use the chi-square test of independence

It is appropriate to use this test when the data consists of two categorical variables, each with two or more levels or categories.

Here are some common scenarios when the chi-square test of independence can be used:

1. Market research: to determine whether there is a significant association between customer satisfaction (e.g., satisfied, neutral, dissatisfied) and brand loyalty (e.g., loyal, neutral, disloyal).

2. Healthcare research: to investigate whether there is a significant association between smoking status (e.g., current smoker, former smoker, non-smoker) and the incidence of lung cancer (e.g., diagnosed, not diagnosed).

3. Social sciences: to study whether there is a significant association between gender (e.g., male, female) and voting behavior (e.g., voted, did not vote) in an election.

4. Educational research: to examine whether there is a significant association between teaching method (e.g., traditional lecture, online video) and exam performance (e.g., pass, fail).

In summary, the chi-square test of independence is used when we want to investigate whether two categorical variables are associated with each other. It is an appropriate test to use when the data consists of two categorical variables, each with two or more categories or levels.

How to calculate the test statistic (formula)

The formula to calculate the chi-square (χ2) test statistic for a contingency table with r rows and c columns is:

χ2 = ∑ (Oij – Eij)² / Eij

where:

  • Oij is the observed frequency in cell i,j of the contingency table.
  • Eij is the expected frequency in cell i,j, calculated as Eij = (ri x cj) / n, where ri is the total frequency in row i, cj is the total frequency in column j, and n is the total sample size.
  • ∑ means to sum over all the cells in the contingency table.

The chi-square test statistic measures the difference between the observed frequencies and the expected frequencies under the null hypothesis of no association between the two categorical variables. A large chi-square value indicates that the observed frequencies are significantly different from the expected frequencies, providing evidence against the null hypothesis.

The degrees of freedom for the test are calculated as (r – 1) x (c – 1), where r is the number of rows and c is the number of columns in the contingency table. The p-value for the test can be obtained from a chi-square distribution table or by using statistical software. If the p-value is less than the level of significance (typically 0.05), we reject the null hypothesis and conclude that there is a significant association between the two categorical variables.

Chi-Square test of independence Procedure

Suppose we are interested in examining the relationship between gender and political party affiliation. We have collected data from a sample of 500 individuals, and we have recorded their gender (male or female) and political party affiliation (Republican, Democrat, or Independent). Here is the contingency table for the data:

Republican Democrat Independent Total
Male 60 120 70 250
Female 90 80 30 200
Total 150 200 100 500

Here are the general steps to perform the chi-square (χ2) test of independence:

1. Define the null and alternative hypotheses:

State the null hypothesis, which assumes no association between the two categorical variables, and the alternative hypothesis, which assumes a significant association between the variables.

The null hypothesis is that there is no significant association between gender and political party affiliation. The alternative hypothesis is that there is a significant association between gender and political party affiliation.

H0: There is no association between gender and political party affiliation

Ha: There is an association between gender and political party affiliation

2. Collect data:

Collect data on the two categorical variables of interest. Each variable should have at least two levels or categories. Here we have collected data on gender and political party affiliation from a sample of 500 individuals.

3. Create a contingency table:

Create a contingency table to display the frequencies of the two categorical variables. The table should have r rows and c columns, where r is the number of levels or categories for variable 1 and c is the number of levels or categories for variable 2.

We have created a contingency table above showing the frequencies of gender and political party affiliation.

4. Calculate the expected frequencies:

Calculate the expected frequencies for each cell of the contingency table based on the null hypothesis of no association between the variables.

We need to calculate the expected frequencies for each cell of the contingency table based on the null hypothesis. We do this by multiplying the row and column totals for each cell and dividing by the total sample size. For example, the expected frequency for the cell in the first row and first column is (250 x 150) / 500 = 75.

Republican Democrat Independent Total
Male 75 100 50 250
Female 75 100 50 200
Total 150 200 100 500

5. Calculate the chi-square test statistic:

Calculate the chi-square test statistic using the formula: χ2 = ∑ (Oij – Eij)² / Eij, where Oij is the observed frequency in cell i,j, and Eij is the expected frequency in cell i,j.

We can now calculate the chi-square test statistic using the formula: χ2 = ∑ (Oij – Eij)² / Eij, where Oij is the observed frequency in cell i,j, and Eij is the expected frequency in cell i,j.

The calculations are shown in the table below:

Republican Democrat Independent Total
Male (60-75)²/75 (120-100)²/100 (70-50)²/50 250
Female (90-75)²/75 (80-100)²/100 (30-50)²/50 200
Total 150 200 100 500

Summing the values in the table, we get:

χ2 = 7.76

6. Determine the degrees of freedom and p-value:

Here, the degrees of freedom are calculated as (r – 1) x (c – 1), where r is the number of rows and c is the number of columns in the contingency table. In this case, we have (2-1) x (3-1) = 2 degrees of freedom.

Using a chi-square distribution table or statistical software, we find that the p-value associated with a chi-square statistic of 7.76 and 2 degrees of freedom is approximately 0.02.

7. Make a decision and interpret the results:

If the p-value is less than the level of significance (typically 0.05), reject the null hypothesis and conclude that there is a significant association between the two categorical variables. If the p-value is greater than or equal to the level of significance, fail to reject the null hypothesis and conclude that there is not enough evidence to support the presence of an association.

Finally, we compare the p-value to our level of significance (α) to make a decision about whether to reject or fail to reject the null hypothesis. Let’s say we choose a significance level of 0.05. Since the p-value (0.02) is less than the significance level, we reject the null hypothesis.

8. Report the results:

Report the chi-square test statistic, degrees of freedom, p-value, and any conclusions drawn from the test.

This means that there is a statistically significant association between gender and political party affiliation. Specifically, we can say that females are more likely to be Democrats than Republicans, while males are more evenly split between the two parties.

Overall, the chi-square test of independence is a useful tool for investigating the relationship between two categorical variables. By comparing the observed and expected frequencies, we can determine whether there is evidence to suggest that the variables are associated with one another.

In summary, the chi-square test of independence involves defining hypotheses, collecting data, creating a contingency table, calculating expected frequencies, calculating the chi-square test statistic, determining the degrees of freedom and p-value, making a decision, and reporting the results.

Chi-Square Goodness of Fit Test

Inference for Two Independent Proportions