I have been using these 2 methods interchangeably to subset data from a data frame in R.
Method 1
subset_df <- df[which(df$age>5) , ]
Method 2
subset_df <- subset(df, age>5)
I had 2 questions belonging to these.
1. Which one is faster considering I have very large data?
2. This post here Subsetting data frames in R suggests that there is in fact difference between above 2 methods. One of them handles NA accurately. Which one is safe to use then?
The question asks for a faster way to subset rows of a data frame. The fastest way is with data.table.
So in this simple case data.table is a little more than twice as fast as
which(...)
, and more than 6 times faster thansubset(...)
.I re-write code by adding:
subsetting operator [[;
filter from "dplyr" package;
function that uses standard evaluation.
The best results were for data.table and df %>% filter(age > 5) operators. So, data.frame with dplyr can also be useful.