R Programming
- Overview of R
- Installing R on Windows
- Download and Install RStudio on Windows
- Setting Your Working Directory (Windows)
- Getting Help with R
- Installing R Packages
- Loading R Packages
- Take Input and Print in R
- R Objects and Attributes
- R Data Structures
- R – Operators
- Vectorization
- Dates and Times
- Data Summary
- Reading and Writing Data to and from R
- Control Structure
- Loop Functions
- Functions
- Data Frames and dplyr Package
- Generating Random Numbers
- Random Number Seed in R
- Random Sampling
- Data Visualization Using R
dplyr Package – filter()
Filter rows with filter():
The filter() function is used to extract subsets of rows from a data frame. This function is similar to the existing subset() function in R but is quite a bit faster in my experience.
For the examples in this section we will be using a built-in data set in R called iris data set. First load the data set using data(“iris”) command. To the help file for iris just type ?iris. Don’t forget to load the dplyr package.
library(dplyr)
library(datasets)
#OR
data("iris")?iris
You can see some basic characteristics of the dataset with the dim() and summary() functions.
dim(iris)
summary(iris)
Output:
dim(iris)
[1] 150 5
> summary(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 setosa :50
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 versicolor:50
Median :5.800 Median :3.000 Median :4.350 Median :1.300 virginica :50
Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
Example:
Suppose if you want to extract the rows of the iris data frame where the Sepal.Length are greater than mean Sepal.Length(i.e. more than 5.843).
x<-filter(iris, Sepal.Length > 5.843)
head(x)
summary(x)
Output:
> head(x)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 7.0 3.2 4.7 1.4 versicolor
2 6.4 3.2 4.5 1.5 versicolor
3 6.9 3.1 4.9 1.5 versicolor
4 6.5 2.8 4.6 1.5 versicolor
5 6.3 3.3 4.7 1.6 versicolor
6 6.6 2.9 4.6 1.3 versicolor
> summary(x)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
Min. :5.90 Min. :2.200 Min. :4.000 Min. :1.000 setosa : 0
1st Qu.:6.20 1st Qu.:2.800 1st Qu.:4.700 1st Qu.:1.500 versicolor:26
Median :6.45 Median :3.000 Median :5.100 Median :1.800 virginica :44
Mean :6.58 Mean :2.970 Mean :5.239 Mean :1.811
3rd Qu.:6.80 3rd Qu.:3.175 3rd Qu.:5.700 3rd Qu.:2.100
Max. :7.90 Max. :3.800 Max. :6.900 Max. :2.500
If you observe properly then you can see after doing the filtering only 70 rows are returning out of 150 rows. Among these 70 rows 44 are belongs to virginica and 26 are versicolor.
Example 2:
You can also apply multiple condition like the following example.
y<-filter(iris, Sepal.Length > 5.843 & Petal.Length >5.1)
summary(y$Sepal.Length)
summary(y$Petal.Length)summary(y$Species)
Output:
> summary(y$Sepal.Length)
Min. 1st Qu. Median Mean 3rd Qu. Max.
6.100 6.400 6.700 6.862 7.200 7.900
> summary(y$Petal.Length)
Min. 1st Qu. Median Mean 3rd Qu. Max.
5.200 5.525 5.700 5.826 6.075 6.900
> summary(y$Species)
setosa versicolor virginica
0 0 34
After applying both the conditions only 34 cases are returning by the filtering function. All the 34 cases are belongs to versicolor category.