### Data Basics: Summary Statistics

R has built in functions for a large number of summary statistics. Few of them will be shown here with example. For this section “SwimRecords” dataset will be used. Download the dataset from here.

SwimRecords
100 m Swimming World Records

##### Description

World records for men and women over time from 1905 through 2004.

##### Usage

data(SwimRecords)

##### Format

A data frame with 62 observations of the following variables.

time time (in seconds) of the world record

year Year in which the record was set

sex a factor with levels M and F

`swim <- read.csv(url("http://bit.ly/2tWRogX"))`

Check the class of an object (numeric, matrix, data frame, etc) using the function class().

# class of an object (numeric, matrix, data frame, etc)
class(swim)

Output:

`class(swim)[1] "data.frame"`

You can print swim data by typing the varibale name where you stored i.e. swim or View(swim)

swim
View(swim)

To print first 20 rows of Swim data use head() function. By default, they return last 6 rows but you can print more using the parameter value n.

# print first 20 rows of Swim data

Output:

X year time sex
1 1 1905 65.8 M
2 2 1908 65.6 M
3 3 1910 62.8 M
4 4 1912 61.6 M
5 5 1918 61.4 M
6 6 1920 60.4 M

Similarly tail() will give you last few rows of your data. By default, they also return last 6 rows like head() but you can print more using the parameter value n.

# print last 20 rows of swim data
tail(swim, n=20)

Output:

> tail(swim)
X year time sex
57 57 1980 54.79 F
58 58 1986 54.73 F
59 59 1992 54.48 F
60 60 1994 54.01 F
61 61 2000 53.77 F
62 62 2004 53.52 F

Get the dimensions of swim data using dim() function. So, dim() function will give you the number of cases and variables in your data.

# dimensions of swim data
dim(swim)

Output:

dim(swim)
[1] 62 4

To know what are the variables you have in the data use names() function.

# list the variables in swim data
names(swim)

Output:

`names(swim)[1] "X" "year" "time" "sex"  #Swim data has 4 variables.`

If you want to get the structure of a data frame use str() function.

# list the structure of swim data
str(swim)

Output:

`str(swim)'data.frame': 62 obs. of 4 variables: \$ X : int 1 2 3 4 5 6 7 8 9 10 ... \$ year: int 1905 1908 1910 1912 1918 1920 1922 1924 1934 1935 ... \$ time: num 65.8 65.6 62.8 61.6 61.4 60.4 58.6 57.4 56.8 56.6 ... \$ sex : Factor w/ 2 levels "F","M": 2 2 2 2 2 2 2 2 2 2 ...`

If you are working with categorical variable and you want to know how many categories are there in it use levels() function.

# list levels of factor varibale in swim data.
levels(swim\$sex)

Output:

`levels(swim\$sex)[1] "F" "M"`

is.na() is function that returns TRUE if your data has any missing value. You can apply this function to a data frame or any particular column of the data frame to get to know any missing vaules are there or not.

is.na(swim) # returns TRUE if swim data has missing value
is.na(swim\$sex) # returns TRUE if sex column of swim data is having missing vaule

In R, summary() function is used to get overall summary of the data frame. It will basically return mean,median,25th and 75th quartiles,min and max.

# mean,median,25th and 75th quartiles,min,max
summary(swim)

Output:

>summary(swim)
X year time sex
Min. : 1.00 Min. :1905 Min. :47.84 F:31
1st Qu.:16.25 1st Qu.:1924 1st Qu.:53.64 M:31
Median :31.50 Median :1956 Median :56.88
Mean :31.50 Mean :1952 Mean :59.92
3rd Qu.:46.75 3rd Qu.:1976 3rd Qu.:65.20
Max. :62.00 Max. :2004 Max. :95.00

nrow() is a function that returns number of rows in the data frame.

# To see how many cases there are in a data frame, use nrow():
nrow(swim)

Output:

>nrow(swim)
[1] 62

ncol() is a function very similar to nrow() that returns number of columns in the data frame.

# To see how many variables there are in a data frame, use ncol():
ncol(swim)

Output:

> ncol(swim)
[1] 4

min() gives the minimum value of a column.

`#Check min value in the “time” column of swim datamin(swim\$time)`

Output:

> min(swim\$time)
[1] 47.84

Similarly max() function will return maximum value of a column.

#Check max value in the “time” column of swim data
max(swim\$time)

Output:

> max(swim\$time)
[1] 95

mean() and median() are two functions that are useful to calculate the mean and median of any column or vector.

#Calculate mean swimming time from swim data
mean(swim\$time

Output:

[1] 59.92419

#Calculate median year from swim data
median(swim\$time)

Output:

[1] 56.88

table() function creates tabular results of categorical variables.If you have a factor variable and you want to know how many levles are there then use table() function. It will give the frequencies with its levels.

#Table function will give you frequencies of male and female
table(swim\$sex)

Output:

F M
31 31

You can create a logical vector and passed into table like below example.

## a logical vector is created and passed into table
table(swim\$year> 1980)

Output:

FALSE TRUE
50 12

If you want to know how many people completed swiming less than 55 sec according to their gender then use table() function like this.

table(swim\$time<55, swim\$sex)

Output:

F  M
FALSE 25 14
TRUE 6 17

The with( ) function applys an expression to a dataset. You may have seen once we are accessing the variable we use dollar sign(\$). you can use the special with() function that instructs any function to refer to a data frame.
For example

sqrt(swim\$year)
#OR
with( data=swim, sqrt(year)) #if you use this then no need of dollar sign and variable

Output:

>sqrt(swim\$year)
[1] 43.64631 43.68066 43.70355 43.72642 43.79498 43.81780 43.84062 43.86342 43.97727 43.98863 44.00000 44.09082 44.12482 44.13615
[15] 44.21538 44.23799 44.28318 44.31704 44.35087 44.36215 44.38468 44.40721 44.44097 44.45222 44.50843 44.55334 44.56456 44.58699
[29] 44.65423 44.72136 44.72136 43.68066 43.70355 43.71499 43.72642 43.76071 43.81780 43.85202 43.86342 43.88622 43.92038 43.93177
[43] 43.94315 43.96590 43.97727 44.00000 44.22669 44.24929 44.27189 44.29447 44.31704 44.40721 44.41846 44.42972 44.45222 44.47471
[57] 44.49719 44.56456 44.63183 44.65423 44.72136 44.76606
> #OR
> with( data=swim, sqrt(year)) #if you use this then no need of dollar sign and variable
[1] 43.64631 43.68066 43.70355 43.72642 43.79498 43.81780 43.84062 43.86342 43.97727 43.98863 44.00000 44.09082 44.12482 44.13615
[15] 44.21538 44.23799 44.28318 44.31704 44.35087 44.36215 44.38468 44.40721 44.44097 44.45222 44.50843 44.55334 44.56456 44.58699
[29] 44.65423 44.72136 44.72136 43.68066 43.70355 43.71499 43.72642 43.76071 43.81780 43.85202 43.86342 43.88622 43.92038 43.93177
[43] 43.94315 43.96590 43.97727 44.00000 44.22669 44.24929 44.27189 44.29447 44.31704 44.40721 44.41846 44.42972 44.45222 44.47471
[57] 44.49719 44.56456 44.63183 44.65423 44.72136 44.76606

Sometimes you will compute a new quantity from the existing variables and want to treat this as a new variable. Adding a new variable to a data frame can be done with the transform() function. For instance, here is how to create a new variable in swim that holds the time converted from seconds to units of minutes:

swim<- transform( swim, minutes = time/60 )

The new variable, minutes, appears just like the old ones:

names(swim)

Output:

`[1] "X"       "year"    "time"    "sex"     "minutes"`

You could also, if you want, redefine an existing variable, for instance:

swim = transform( swim, time=time/60 )

After transformation check few rows of data using head() function.

Output: