Basic Statistics
- Data Science Essentials: 10 Statistical Concepts
- Cases, Variables, Types of variables
- Matrix and Frequency Table
- Graphs and shapes of Distributions
- Mode, Median and Mean
- Range, Interquartile Range and Box Plot
- Variance and Standard Deviation
- Z-score or Standardized Score
- Contingency Table, Scatterplot, Pearson’s r
- Basics of Regression
- Elementary Probability
- Random Variables and Probability Distributions
- Normal Distribution, Binomial Distribution & Poisson Distribution
Range, Interquartile Range and Box Plot
Range
The range is a statistical measure that is calculated by subtracting the minimum value of a dataset from the maximum value. It is a simple measure of variability, but it is sensitive to outliers, as just one extreme value can greatly affect the range.
Range Example
Let’s take the below example.
If you consider both the team their Mode= 14.1, Median=15 and Mean=15
This indicates that, if you adequately describe a distribution some time it may need more information than the measures of central tendency.
In this situation measures of variability comes into picture. They are
- Range
- Interquartile range.
- Box Plot to get good indication of how the values in a distribution are spread out.
The most simple measure of variability is the range. It is the difference between the highest and the lowest value.
For the above Example range will be:
Range (Team1) = 19.3 – 10.8 = 8.5
Range (Team2) = 27.7-0 = 27.7
As ranges takes only the count of extreme values sometimes it may not give you a good impact on variability. In this case, you can go for another measure of variability called interquartile range (IQR).
Interquartile Ranges & Outliers
The interquartile range (IQR) is a measure of statistical dispersion that is based on dividing a data set into quartiles. Specifically, it is the difference between the upper quartile (Q3) and the lower quartile (Q1) of a data set.
To calculate the IQR, one must first arrange the data in order from lowest to highest. Then, the median (Q2) of the data set is found, and the lower quartile (Q1) is the median of the lower half of the data set (i.e., the data points below the median), while the upper quartile (Q3) is the median of the upper half of the data set (i.e., the data points above the median). Finally, the IQR is calculated as the difference between Q3 and Q1.
The IQR is often used as a measure of variability or spread in a data set, and is considered a robust statistic since it is less sensitive to outliers or extreme values than the range or standard deviation. It is also commonly used in box plots to visualize the distribution of a data set.
Let’s think, in certain cases, you are comparing two groups. You have already calculated the central tendency of your data i.e. Mean, Median and Mode for both the groups. Sometimes it may happen that mean, median, and mode are same for both groups.
Interquartile Range (IQR)
Interquartile range gives another measure of variability. It is a better measure of dispersion than range because it leaves out the extreme values. It equally divides the distribution into four equal parts called quartiles. First 25% is 1st quartile (Q1), last one is 3rd quartile (Q3) and middle one is 2nd quartile (Q2).
2nd quartile (Q2) divides the distribution into two equal parts of 50%. So, basically it is same as Median.
The interquartile range is the distance between the third and the first quartile, or, in other words, IQR equals Q3 minus Q1
IQR = Q3- Q1
How to Calculate Interquartile Range (IQR)
Step 1: Order from low to high
Step 2: Find the median or in other words Q2
Step 3: Then find Q1 by looking the median of the left side of Q2
Steps 4: Similarly find Q3 by looking the median of the right of Q2
Steps 5: Now subtract Q1 from Q3 to get IQR.
IQR Calculation Example
Consider the below example to get clear idea.
Consider another example to get better understanding.
Consider the following numbers: 1, 3, 4, 5, 5, 6, 7, 11. Q1 is the middle value in the first half of the data set. Since there are an even number of data points in the first half of the data set, the middle value is the average of the two middle values; that is, Q1 = (3 + 4)/2 or Q1 = 3.5. Q3 is the middle value in the second half of the data set. Again, since the second half of the data set has an even number of observations, the middle value is the average of the two middle values; that is, Q3 = (6 + 7)/2 or Q3 = 6.5. The interquartile range is Q3 minus Q1, so IQR = 6.5 – 3.5 = 3.
Advantage of IQR
- The main advantage of the IQR is that it is not affected by outliers because it doesn’t take into account observations below Q1 or above Q3.
- It might still be useful to look for possible outliers in your study.
- As a rule of thumb, observations can be qualified as outliers when they lie more than 1.5 IQR below the first quartile or 1.5 IQR above the third quartile.
Outliers = Q1 – 1.5* IQR OR
=Q3 + 1.5*IQR
What is Box Plot?
Box plots are graphical representations that are commonly used to display the distribution of a dataset and its summary statistics. Box plots display the median, quartiles, range, and outliers of a dataset. The central box represents the IQR, with the median shown as a line inside the box. The lower and upper whiskers represent the minimum and maximum values of the dataset that are not considered outliers, and any points beyond the whiskers are plotted as individual points, representing outliers. Box plots are useful for quickly visualizing the central tendency and variability of a dataset and identifying any extreme values.
So, Box plot is the graph that is mainly used when you are describing center and variability of your data.
It is also useful for detecting outliers in the data.
Carefully, observe the above first IQR example when it is plotted in a boxplot.