tapply in R

Apply a function to each cell of a ragged array, that is to each (non-empty) group of values given by a unique combination of the levels of certain factors. Basically, tapply() applies a function or operation on subset of the vector broken down by a given factor variable.

To understand clearly lets imagine you have height of 1000 people ( 500 male and 500 females), and you want to know the average height of males and females from this sample data. To deal with this problem you can group height by the gender, height of 500 males, and height of 500 females, and later calculate the average height for males and females.

To get the help file type the following code.

?tapply

To see the arguments of tapply() function type str(tapply) in the console.

 

str(tapply)

Output:

function (X, INDEX, FUN = NULL, …, default = NA, simplify = TRUE)

  1.  INDEX is a factor or a list of factors (or else they are coerced to factors)
  2.  FUN is a function to be applied
  3.  … contains other arguments to be passed FUN
  4. simplify, should we simplify the result or not?

Example:

 

x<-runif(20, min=155, max=180) #simulate 20 random heights
y<-gl(2, 10, labels = c("Male", "Female")) #Generate factors by specifying the pattern of their levels.
tapply(x, y, mean)

Output:

Male Female
168.4516 163.8848

Example 2:

There are already some built-in datasets are available in R. Here we will use mtcars dataset. You can always get the help file by typing ?mtcars. We are interested in seeing the avg mpg for the various transmission types and number of cylinders in car. This is nothing but avg mpg grouped by transmission type and the number of cylinders in car.

 

?mtcars
data(mtcars)
str(mtcars)

Output:

‘data.frame’: 32 obs. of 11 variables:
$ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 …
$ cyl : num 6 6 4 6 8 6 8 4 4 6 …
$ disp: num 160 160 108 258 360 …
$ hp : num 110 110 93 110 175 105 245 62 95 123 …
$ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 …
$ wt : num 2.62 2.88 2.32 3.21 3.44 …
$ qsec: num 16.5 17 18.6 19.4 17 …
$ vs : num 0 0 1 1 0 1 0 1 1 1 …
$ am : num 1 1 1 0 0 0 0 0 0 0 …
$ gear: num 4 4 4 3 3 3 3 4 4 4 …
$ carb: num 4 4 1 1 2 1 4 2 2 4 …

 

tapply(mtcars$mpg, list(mtcars$cyl, mtcars$am), mean)

Output:

0         1
4 22.900 28.07500
6 19.125 20.56667
8 15.050 15.40000

Example 3:

Now another example will be shown using iris data set. Check the structure of the data set using str(iris). we want to calculate the mean of the Sepal Length for each Species.

?iris
data(iris)
str(iris)

Output:

‘data.frame’: 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 …
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 …
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 …
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 …
$ Species : Factor w/ 3 levels “setosa”,”versicolor”,..: 1 1 1 1 1 1 1 1 1 1 …

 

tapply(iris$Sepal.Length, iris$Species, mean)

Output:

setosa versicolor virginica
5.006    5.936         6.588

Similarly, you can calculate the mean of the Petal Length for each Species.

tapply(iris$Petal.Length, iris$Species, mean)

Output:

setosa versicolor virginica
1.462      4.260        5.552

sapply

split