Statistics with R
- Statistics with R
- R Objects, Numbers, Attributes, Vectors, Coercion
- Matrices, Lists, Factors
- Data Frames in R
- Control Structures in R
- Functions in R
- Data Basics: Compute Summary Statistics in R
- Central Tendency and Spread in R Programming
- Data Basics: Plotting – Charts and Graphs
- Normal Distribution in R
- Skewness of statistical data
- Bernoulli Distribution in R
- Binomial Distribution in R Programming
- Compute Randomly Drawn Negative Binomial Density in R Programming
- Poisson Functions in R Programming
- How to Use the Multinomial Distribution in R
- Beta Distribution in R
- Chi-Square Distribution in R
- Exponential Distribution in R Programming
- Log Normal Distribution in R
- Continuous Uniform Distribution in R
- Understanding the t-distribution in R
- Gamma Distribution in R Programming
- How to Calculate Conditional Probability in R?
- How to Plot a Weibull Distribution in R
- Hypothesis Testing in R Programming
- T-Test in R Programming
- Type I Error in R
- Type II Error in R
- Confidence Intervals in R
- Covariance and Correlation in R
- Covariance Matrix in R
- Pearson Correlation in R
- Normal Probability Plot in R
Central Tendency and Spread in R Programming
How to Calculate Central Tendency and Spread in R Programming
The central tendency of a dataset refers to the central or typical value around which the data points tend to cluster. There are several measures of central tendency, including mean, median, and mode.
For this section mtcars dataset will be used. To get that data, install ggplot2 package and load the package if you didn’t do it till now. Then load the data.
Use the below code do that.
install.packages("ggplot2") library(ggplot2) data(mtcars)
Now you can access the mtcars data by using ‘mtcars’. Explore the data little bit using names(), str(), summary(), dim() functions etc.
str(mtcars)
Output:
'data.frame': 32 obs. of 11 variables: $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ... $ cyl : num 6 6 4 6 8 6 8 4 4 6 ... $ disp: num 160 160 108 258 360 ... $ hp : num 110 110 93 110 175 105 245 62 95 123 ... $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ... $ wt : num 2.62 2.88 2.32 3.21 3.44 ... $ qsec: num 16.5 17 18.6 19.4 17 ... $ vs : num 0 0 1 1 0 1 0 1 1 1 ... $ am : num 1 1 1 0 0 0 0 0 0 0 ... $ gear: num 4 4 4 3 3 3 3 4 4 4 ... $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
dim(mtcars)
Output:
[1] 32 11
Meaning of a question mark before a data object(?)
You can type ?mtcars in your R console to get some more help and detailed description about mtcars dataset.
?mtcars
Measure Central Tendency in R
A measure of central tendency is a single value that attempts to describe a set of data by identifying the central position within that set of data. The mean and median are the most likely measure of central tendency for numerical data.
Mean
The mean is the average value of a set of data points. In R, the mean()
function can be used to calculate the mean.
Median
The median is the middle value in a set of data points when they are arranged in order. In R, the median()
function can be used to calculate the median.
Mode
The mode is the value that occurs most frequently in a set of data points. In R, there is no built-in function to calculate the mode. However, you can create a function to calculate it.
There is a variable in this dataset called ‘mpg’ Miles/(US) gallon. If you want to know mean and median mpg miles/gallon then type the below code.
mean(mtcars$mpg) median(mtcars$mpg)
Output:
> mean(mtcars$mpg) [1] 20.09062 > median(mtcars$mpg) [1] 19.2
Measures of Spread in R
The spread of a dataset refers to the amount of variability or dispersion among the data points. There are several measures of spread, including range, variance, and standard deviation.
To get the measures of spread you can use variance, standard deviation, interquartile range (IQR), minimum value, maximum value, range etc.
Range
The range is the difference between the maximum and minimum values in a set of data points. In R, you can calculate the range using the range()
function.
Variance
The variance is a measure of how much the data points vary from the mean. In R, you can calculate the variance using the var()
function.
Standard deviation
The standard deviation is the square root of the variance and is also a measure of the amount of variability among the data points. In R, you can calculate the standard deviation using the sd()
function.
Calculate Variance in R
var(mtcars$mpg)
Output:
[1] 36.3241
Calculate Standard deviation in R
sd(mtcars$mpg)
Output:
[1] 6.026948
Calculate Interquartile range (IQR) in R
IQR(mtcars$mpg)
Output:
[1] 7.375
The min()
, max()
, and range()
functions in R
In R, you can use the min()
, max()
, and range()
functions to find the minimum value, maximum value, and range of a vector or a set of values.
min(mtcars$mpg) max(mtcars$mpg) range(mtcars$mpg)
Output:
> min(mtcars$mpg) [1] 10.4 > max(mtcars$mpg) [1] 33.9 > range(mtcars$mpg) [1] 10.4 33.9
Categorical Variable
In statistics and data analysis, a categorical variable is a variable that can take on one of a limited, finite set of values, known as categories or levels. These categories can be either nominal or ordinal.
Categorical variables are often represented in R using factors, which are variables that have a fixed set of categories, or levels. Factors are useful in R for representing categorical variables because they can be ordered, labeled, and used in statistical models.
For categorical variables, counts and percentages can be used for summary.
table(mtcars$cyl) table(mtcars$cyl)/nrow(mtcars)
Output:
> table(mtcars$cyl) 4 6 8 11 7 14 > > table(mtcars$cyl)/nrow(mtcars) 4 6 8 0.34375 0.21875 0.43750
The unique()
function in R
If you want to know how many unique values are there in a column then use unique() function.
unique(mtcars$cyl)
Output:
[1] 6 4 8
The table()
function in R
In R, the table()
function is used to create a frequency table of the counts of values in a vector or a data frame column. The output of the table()
function is a table object that shows the number of occurrences of each unique value in the vector or column.
If you want to get a frequency table for Number of cylinders vs Number of carburetors then use the below code.
table(mtcars$cyl, mtcars$carb)
Output:
>table(mtcars$cyl, mtcars$carb) 1 2 3 4 6 8 4 5 6 0 0 0 0 6 2 0 0 4 1 0 8 0 4 3 6 0 1