dplyr Package – group_by()

The group_by() function is used to generate summary statistics from the data frame within strata defined by a variable. The group_by() function first sets up how you want to group your data. The general operation here is a combination of splitting a data frame into separate pieces defined by a variable or group of variables (group_by()), and then applying a summary function across those subsets (summarize()).

For the examples in this section we will be using a built-in data set in R called mtcars data set. First load the data set using data(“mtcars”) command. To the help file for sleep data just type ?mtcars. Don’t forget to load the dplyr package.

 

library(dplyr)
library(datasets)
#OR
data("mtcars")

?mtcars

You can see some basic characteristics of the dataset with the dim() and str() functions.

dim(mtcars)
str(mtcars)
names(mtcars)

Output:

dim(mtcars)
[1] 32 11
> str(mtcars)
‘data.frame’: 32 obs. of 11 variables:
$ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 …
$ cyl : num 6 6 4 6 8 6 8 4 4 6 …
$ disp: num 160 160 108 258 360 …
$ hp : num 110 110 93 110 175 105 245 62 95 123 …
$ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 …
$ wt : num 2.62 2.88 2.32 3.21 3.44 …
$ qsec: num 16.5 17 18.6 19.4 17 …
$ vs : num 0 0 1 1 0 1 0 1 1 1 …
$ am : num 1 1 1 0 0 0 0 0 0 0 …
$ gear: num 4 4 4 3 3 3 3 4 4 4 …
$ carb: num 4 4 1 1 2 1 4 2 2 4 …

> names(mtcars)
[1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear" "carb"

Example:

Now we can group the data frame by the cyl variable.

cyl <- group_by(mtcars, cyl)
summarise(cyl, mean(disp), mean(hp))

Output:

> summarise(cyl, mean(disp), mean(hp))
# A tibble: 3 x 3
cyl `mean(disp)` `mean(hp)`
<dbl> <dbl> <dbl>
1 4 105.1364 82.63636
2 6 183.3143 122.28571
3 8 353.1000 209.21429

Example 2:

groupby_vs_am <- group_by(mtcars, vs, am)
summarise(by_vs_am, n = n())

Output:

> summarise(by_vs_am, n = n())
Source: local data frame [4 x 3]
Groups: vs [?]

# A tibble: 4 x 3
vs am n
<dbl> <dbl> <int>
1 0 0 12
2 0 1 6
3 1 0 7
4 1 1 7

mutate() Function in dplyr

pipeline operater in dplyr