R Programming
- Overview of R
- Installing R on Windows
- Download and Install RStudio on Windows
- Setting Your Working Directory (Windows)
- Getting Help with R
- Installing R Packages
- Loading R Packages
- Take Input and Print in R
- R Objects and Attributes
- R Data Structures
- R – Operators
- Vectorization
- Dates and Times
- Data Summary
- Reading and Writing Data to and from R
- Control Structure
- Loop Functions
- Functions
- Data Frames and dplyr Package
- Generating Random Numbers
- Random Number Seed in R
- Random Sampling
- Data Visualization Using R
split function in R
The split() function takes a vector or other objects and splits it into groups determined by a factor or list of factors. The basic idea is that you can take a data structure, split it into subsets defined by another variable, and apply a function over those subsets.
You can get the help file by typing ?split
?spilt
The arguments of split() can be shown by just typing split in your R console.
split
Output:
function (x, f, drop = FALSE, …)
Here,
- x is a vector (or list) or data frame
- f is a factor (or coerced to one) or a list of factors
- drop indicates whether empty factors levels should be dropped
Example:
Here we will simulate some data and split it according to a factor variable. Note that gl() function is used to “generate levels” in a factor variable.
set.seed(1)
x<-runif(20, min=155, max=180) #simulate 20 random heights
y<-gl(2, 10, labels = c("Male", "Female")) #Generate factors by specifying the pattern of their levels.
s<-split(x, y)
s
lapply(s, mean)
Output:
> s
$Male
[1] 161.6377 164.3031 169.3213 177.7052 160.0420 177.4597 178.6169 171.5199 170.7279 156.5447$Female
[1] 160.1494 159.4139 172.1756 164.6026 174.2460 167.4425 172.9405 179.7977 164.5009 174.4361> lapply(s, mean)
$Male
[1] 168.7878$Female
[1] 168.9705
Split a Data Frame:
Here we will use a dataset called airquality. To get the help file just type ?airquality. Check the structure of the data set using str(airquality).
?airquality
library(datasets)
str(airquality)
Output:
‘data.frame’: 153 obs. of 6 variables:
$ Ozone : int 41 36 12 18 NA 28 23 19 8 NA …
$ Solar.R: int 190 118 149 313 NA NA 299 99 19 194 …
$ Wind : num 7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 …
$ Temp : int 67 72 74 62 56 66 65 59 61 69 …
$ Month : int 5 5 5 5 5 5 5 5 5 5 …
$ Day : int 1 2 3 4 5 6 7 8 9 10 …
You can split the airquality data frame by the Month variable using following code.
mydata <- split(airquality, airquality$Month)
str(mydata)
Output:
List of 5
$ 5:’data.frame’: 31 obs. of 6 variables:
..$ Ozone : int [1:31] 41 36 12 18 NA 28 23 19 8 NA …
..$ Solar.R: int [1:31] 190 118 149 313 NA NA 299 99 19 194 …
..$ Wind : num [1:31] 7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 …
..$ Temp : int [1:31] 67 72 74 62 56 66 65 59 61 69 …
..$ Month : int [1:31] 5 5 5 5 5 5 5 5 5 5 …
..$ Day : int [1:31] 1 2 3 4 5 6 7 8 9 10 …
$ 6:’data.frame’: 30 obs. of 6 variables:
..$ Ozone : int [1:30] NA NA NA NA NA NA 29 NA 71 39 …
..$ Solar.R: int [1:30] 286 287 242 186 220 264 127 273 291 323 …
..$ Wind : num [1:30] 8.6 9.7 16.1 9.2 8.6 14.3 9.7 6.9 13.8 11.5 …
..$ Temp : int [1:30] 78 74 67 84 85 79 82 87 90 87 …
..$ Month : int [1:30] 6 6 6 6 6 6 6 6 6 6 …
..$ Day : int [1:30] 1 2 3 4 5 6 7 8 9 10 …
$ 7:’data.frame’: 31 obs. of 6 variables:
..$ Ozone : int [1:31] 135 49 32 NA 64 40 77 97 97 85 …
..$ Solar.R: int [1:31] 269 248 236 101 175 314 276 267 272 175 …
..$ Wind : num [1:31] 4.1 9.2 9.2 10.9 4.6 10.9 5.1 6.3 5.7 7.4 …
..$ Temp : int [1:31] 84 85 81 84 83 83 88 92 92 89 …
..$ Month : int [1:31] 7 7 7 7 7 7 7 7 7 7 …
..$ Day : int [1:31] 1 2 3 4 5 6 7 8 9 10 …
$ 8:’data.frame’: 31 obs. of 6 variables:
..$ Ozone : int [1:31] 39 9 16 78 35 66 122 89 110 NA …
..$ Solar.R: int [1:31] 83 24 77 NA NA NA 255 229 207 222 …
..$ Wind : num [1:31] 6.9 13.8 7.4 6.9 7.4 4.6 4 10.3 8 8.6 …
..$ Temp : int [1:31] 81 81 82 86 85 87 89 90 90 92 …
..$ Month : int [1:31] 8 8 8 8 8 8 8 8 8 8 …
..$ Day : int [1:31] 1 2 3 4 5 6 7 8 9 10 …
$ 9:’data.frame’: 30 obs. of 6 variables:
..$ Ozone : int [1:30] 96 78 73 91 47 32 20 23 21 24 …
..$ Solar.R: int [1:30] 167 197 183 189 95 92 252 220 230 259 …
..$ Wind : num [1:30] 6.9 5.1 2.8 4.6 7.4 15.5 10.9 10.3 10.9 9.7 …
..$ Temp : int [1:30] 91 92 93 93 87 84 80 78 75 73 …
..$ Month : int [1:30] 9 9 9 9 9 9 9 9 9 9 …
..$ Day : int [1:30] 1 2 3 4 5 6 7 8 9 10 …
Then, you can take the column means for Ozone, Solar.R, and Wind for each sub-data frame using the following code.
sapply(mydata, function(x) {colMeans(x[, c("Ozone", "Solar.R", "Wind")])})
Output:
5 6 7 8 9
Ozone NA NA NA NA NA
Solar.R NA 190.16667 216.483871 NA 167.4333
Wind 11.62258 10.26667 8.941935 8.793548 10.1800