Basic Statistics
- Data Science Essentials: 10 Statistical Concepts
- Cases, Variables, Types of variables
- Matrix and Frequency Table
- Graphs and shapes of Distributions
- Mode, Median and Mean
- Range, Interquartile Range and Box Plot
- Variance and Standard Deviation
- Z-score or Standardized Score
- Contingency Table, Scatterplot, Pearson’s r
- Basics of Regression
- Elementary Probability
- Random Variables and Probability Distributions
- Normal Distribution, Binomial Distribution & Poisson Distribution
Variance and Standard Deviation
Variance and standard deviation are statistical measures that are used to describe the amount of variability or spread in a dataset.
In “Range, Interquartile Range and Box Plot” section, it is explained that Range, Interquartile Range (IQR) and Box plot are very useful to measure the variability of the data.
There are two other kind of variability that a statistician use very often for their study.
- Variance
- Standard Deviation
Variance measures how far a set of numbers is spread out from their average or mean value. It is calculated by taking the average of the squared differences between each number and the mean of the dataset. A higher variance indicates that the numbers are more spread out from their mean value.
Standard deviation is the square root of the variance and provides a measure of how much the data deviates from the mean. It is expressed in the same units as the data and is a more intuitive measure of the spread of the data since it is on the same scale.
Both variance and standard deviation are important measures in many areas of statistics, including hypothesis testing, quality control, and data analysis.
Why variance and Standard Deviation are good measures of variability?
Because variance and standard deviation consider all the values of a variable to calculate the variability of your data.
There are two types of variance and standard deviation in terms of Sample and Population. First their formula has been given. Then, what is the difference between sample and population has been discussed below.
Variance
Here is the formula for sample and population variance and standard deviation. There is slight difference observe them carefully.
Where
- X is individual one value
- N is size of population
- x̄ is the mean of population
How to calculate variance step by step
- Calculate the mean x̄.
- Subtract the mean from each observation. X- x̄
- Square each of the resulting observations. (X- x̄) ^2
- Add these squared results together.
- Divide this total by the number of observations n (in case of population) to get variance S2. If you are calculating sample variance then divide by n-1.
- Use the positive square root to get standard deviation S.
Here,
N =11
N-1=10
Mean (x̄) =15
Sample variance ( s² ) = 639.74/10 = 63.97
Population ( σ² ) = 639.74/11 = 58.16
S = 8.00
σ = 7.6
Intuition
- If variance is high, that means you have larger variability in your dataset. In the other way, we can say more values are spread out around your mean value.
- Standard deviation represents the average distance of an observation from the mean
- The larger the standard deviation, larger the variability of the data.
Properties of Variance
- It is always non-negative since each term in the variance sum is squared and therefore the result is either positive or zero.
- Variance always has squared units. For example, the variance of a set of weights estimated in kilograms will be given in kg squared. Since the population variance is squared, we cannot compare it directly with the mean or the data themselves.
Standard Deviation
The Standard Deviation is a measure of how spread out numbers are. Its symbol is σ (the greek letter sigma) for population standard deviation and S for sample standard deviation. It is the square root of the Variance.
Properties of Standard Deviation
- It describes the square root of the mean of the squares of all values in a data set and is also called the root-mean-square deviation.
- The smallest value of the standard deviation is 0 since it cannot be negative.
- When the data values of a group are similar, then the standard deviation will be very low or close to zero. But when the data values vary with each other, then the standard variation is high or far from zero.
Population vs. Sample Variance and Standard Deviation
The primary task of inferential statistics (or estimating or forecasting) is making an opinion about something by using only an incomplete sample of data.
In statistics, it is very important to distinguish between population and sample. A population is defined as all members (e.g. occurrences, prices, annual returns) of a specified group. Population is the whole group.
A sample is a part of a population that is used to describe the characteristics (e.g. mean or standard deviation) of the whole population. The size of a sample can be less than 1%, or 10%, or 60% of the population, but it is never the whole population. As both sample and population are not same thing therefore slight difference is there in their formula.
A question may raise that at the time of calculating Variance why we do square the difference?
To get rid of negatives so that negative and positive don’t cancel each other when added together.
+5 -5 = 0