R Programming
- Overview of R
- Installing R on Windows
- Download and Install RStudio on Windows
- Setting Your Working Directory (Windows)
- Getting Help with R
- Installing R Packages
- Loading R Packages
- Take Input and Print in R
- R Objects and Attributes
- R Data Structures
- R – Operators
- Vectorization
- Dates and Times
- Data Summary
- Reading and Writing Data to and from R
- Control Structure
- Loop Functions
- Functions
- Data Frames and dplyr Package
- Generating Random Numbers
- Random Number Seed in R
- Random Sampling
- Data Visualization Using R
dplyr Package – %>%
The pipeline operater %>% is very handy for stringing together multiple dplyr functions in a sequence of operations. Pipes take the output from one function and feed it to the first argument of the next function. If we wanted to apply more than one function, the sequence gets buried in a sequence of nested function calls that is difficult to read, i.e. third(second(first(x))). This nesting is not a natural way to think about a sequence of operations. The %>% operator allows you to string operations in a left-to-right fashion, i.e. first(x) %>% second %>% third
For the examples in this section we will be using a built-in data set in R called swiss data set. First load the data set using data(“swiss”) command. To the help file for sleep data just type ?swiss. Don’t forget to load the dplyr package.
library(dplyr)
library(datasets)
#OR
data("swiss")?swiss
You can see some basic characteristics of the dataset with the dim() and str() functions.
dim(swiss)
str(swiss)
names(swiss)
Output:
> dim(swiss)[1] 47 6
>str(swiss)'data.frame': 47 obs. of 6 variables:
$ Fertility : num 80.2 83.1 92.5 85.8 76.9 76.1 83.8 92.4 82.4 82.9 ...
$ Agriculture : num 17 45.1 39.7 36.5 43.5 35.3 70.2 67.8 53.3 45.2 ...
$ Examination : int 15 6 5 12 17 9 16 14 12 16 ...
$ Education : int 12 9 5 7 15 7 7 8 7 13 ...
$ Catholic : num 9.96 84.84 93.4 33.77 5.16 ...
$ Infant.Mortality: num 22.2 22.2 20.2 20.3 20.6 26.6 23.6 24.9 21 24.4 ...
> names(swiss)[1] "Fertility" "Agriculture" "Examination" "Education" "Catholic" "Infant.Mortality"
Example:
Now in this case, we will pipe the swiss data frame to the function that will select two columns (Examination and Education) and then pipe the new data frame to the function head() which will return the head of the new data frame.
swiss %>%
select(Examination, Education) %>%
head
Output:
Examination Education
Courtelary 15 12
Delemont 6 9
Franches-Mnt 5 5
Moutier 12 7
Neuveville 17 15
Porrentruy 9 7
Example 2: Arrange or re-order rows using arrange()
Now, we will select three columns from swiss data, arrange the rows by the Examination and then arrange the rows by Education. And filter the rows where Examination is greater equals 15 and Education is greater than 10.
swiss %>%
select(Agriculture, Examination, Education) %>%
arrange(Examination, Education) %>%
filter(Examination >= 15 & Education>10)
Output:
Agriculture Examination Education
1 17.0 15 12
2 45.2 16 13
3 46.6 16 29
4 43.5 17 15
5 60.7 19 12
6 62.0 21 12
7 50.9 22 12
8 16.7 22 13
9 27.7 22 29
10 26.8 25 19
11 38.4 26 12
12 19.4 26 28
13 7.7 29 11
14 15.2 31 20
15 17.6 35 32
16 1.2 37 53
Example 3: Create a new columns using mutate()
We can use mutate() function to add new columns to the data frame. Create a new column called Examination_Education which is multiplication of Examination and Education.
swiss<-swiss %>%
mutate(Examination_Education = Examination*Education)
head(swiss)[,c("Examination","Education","Examination_Education")]
Output:
Examination Education Examination_Education
1 15 12 180
2 6 9 54
3 5 5 25
4 12 7 84
5 17 15 255
6 9 7 63
Example 4: Create summaries of the data frame using summarise()
The summarise() function will create summary statistics for a given column in the data frame such as finding the mean. For example, to compute the average number of Examination, apply the mean() function to the column Examination and call the summary value Mean_Exam. There are many other summary statistics you could consider such sd(), min(), max(), median(), sum(), n() (returns the length of vector), first() (returns first value in vector), last() (returns last value in vector) and n_distinct() (number of distinct values in vector).
swiss %>%
summarise(Mean_Exam=mean(Examination),Max_Edu=max(Education), Min_Agri=min(Agriculture))
Output:
Mean_Exam Max_Edu Min_Agri
1 16.48936 53 1.2