Data Frames and dplyr package
The data frame is a key data structure in statistics and in R. dplyr package is very very helpful for managing data frames. The dplyr package was developed by Hadley Wickham of RStudio and is an optimized and distilled version of his plyr package. The dplyr package does not provide any “new” functionality to R, in the sense that everything dplyr does could already be done with base R, but it greatly simplifies existing functionality in R. Filtering, re-ordering, and collapsing, can often be tedious operations in R whose syntax is not very intuitive. The dplyr package is designed to mitigate a lot of these problems and to provide a highly optimized set of routines specifically for dealing with data frames.
Some of the key functions provided by the dplyr package are:
- select: Select columns with select(). It returns a subset of the columns of a data frame.
- filter: Filter rows with filter().It extracts a subset of rows from a data frame based on logical conditions.
- arrange: Arrange rows with arrange(). It helps to reorder rows of a data frame.
- rename: rename variables in a data frame
- mutate: Add new columns with mutate(). It helps to add new variables/columns or transform existing variables.
- summarise / summarize: Summarise values with summarise(). This function generates summary statistics of different variables in the data frame.
- %>%: the “pipe” operator is used to connect multiple verb actions together into a pipeline.
Installing the dplyr package:
To install from GitHub you can run the following code.
After installing the package it is important that you load it into your R session with the library() function.
Attaching package: ‘dplyr’
The following objects are masked from ‘package:stats’:
The following objects are masked from ‘package:base’:
intersect, setdiff, setequal, union
In the following section key functions of dplyr package has been discussed one by one.