I am looking for ways to speed up my code. I am looking into the apply
/ply
methods as well as data.table
. Unfortunately, I am running into problems.
Here is a small sample data:
ids1 <- c(1, 1, 1, 1, 2, 2, 2, 2)
ids2 <- c(1, 2, 3, 4, 1, 2, 3, 4)
chars1 <- c("aa", " bb ", "__cc__", "dd ", "__ee", NA,NA, "n/a")
chars2 <- c("vv", "_ ww_", " xx ", "yy__", " zz", NA, "n/a", "n/a")
data <- data.frame(col1 = ids1, col2 = ids2,
col3 = chars1, col4 = chars2,
stringsAsFactors = FALSE)
Here is a solution using loops:
library("plyr")
cols_to_fix <- c("col3","col4")
for (i in 1:length(cols_to_fix)) {
data[,cols_to_fix[i]] <- gsub("_", "", data[,cols_to_fix[i]])
data[,cols_to_fix[i]] <- gsub(" ", "", data[,cols_to_fix[i]])
data[,cols_to_fix[i]] <- ifelse(data[,cols_to_fix[i]]=="n/a", NA, data[,cols_to_fix[i]])
}
I initially looked at ddply
, but some methods I want to use only take vectors. Hence, I cannot figure out how to do ddply
across just certain columns one-by-one.
Also, I have been looking at laply
, but I want to return the original data.frame
with the changes. Can anyone help me? Thank you.
Based on the suggestions from earlier, here is what I tried to use from the plyr
package.
Option 1:
data[,cols_to_fix] <- aaply(data[,cols_to_fix],2, function(x){
x <- gsub("_", "", x,perl=TRUE)
x <- gsub(" ", "", x,perl=TRUE)
x <- ifelse(x=="n/a", NA, x)
},.progress = "text",.drop = FALSE)
Option 2:
data[,cols_to_fix] <- alply(data[,cols_to_fix],2, function(x){
x <- gsub("_", "", x,perl=TRUE)
x <- gsub(" ", "", x,perl=TRUE)
x <- ifelse(x=="n/a", NA, x)
},.progress = "text")
Option 3:
data[,cols_to_fix] <- adply(data[,cols_to_fix],2, function(x){
x <- gsub("_", "", x,perl=TRUE)
x <- gsub(" ", "", x,perl=TRUE)
x <- ifelse(x=="n/a", NA, x)
},.progress = "text")
None of these are giving me the correct answer.
apply
works great, but my data is very large and the progress bars from plyr
package would be a very nice. Thanks again.
No need for loops (
for
or*ply
):Benchmarks
I only benchmark Arun's data.table solution and my matrix solution. I assume that many columns need to be fixed.
Benchmark code:
Benchmark results:
data.table is faster only by a factor of three. This advantage could probably be even smaller, if we decide to change the data structure (as the data.table solution does) and keep it a matrix.
Here's a benchmark of all the different answers:
First, all the answers as separate functions:
1) Arun's
2) Martin's
3) Roland's
4) BrodieG's
5) Josilber's
2) benchmarking function:
We'll run this function 3 times and take the minimum of the run (removes cache effects) to be the runtime:
3) On (slightly) big data with just 2 cols to fix (like in OP's example here):
Here's a
data.table
solution usingset
.Note: Using PCRE (
perl=TRUE
) has nice speed-up, especially on bigger vectors.The
apply
version is the way to go. Looks like @josilber came up with the same answer, but this one is slightly different (note regexp).More importantly, generally you want to use
ddply
anddata.table
when you want to do split-apply-combine analysis. In this case, all your data belongs to the same group (there aren't any subgroups you're doing anything different with), so you might as well useapply
.The
2
at the center of theapply
statement means we want to subset the input by the 2nd dimension, and pass the result (in this case vectors, each representing a column from your data frame incols_to_fix
) to the function that does the work.apply
then re-assembles the result, and we assign it back to the columns incols_to_fix
. If we had used1
instead,apply
would have passed the rows in our data frame to the function. Here is the result:If you do have sub-groups, then I recommend you use
data.table
. Once you get used to the syntax it's hard to beat for convenience and speed. It will also do efficient joins across data sets.I think you can do this with regular old
apply
, which will call your cleanup function on each column (margin=2):Edit: it sounds like you're requiring the use of the
plyr
package. I'm not an expert inplyr
, but this seemed to work:Here is a
data.table
solution, should be faster if your table is large. The concept of := is an "update" of the columns. I believe that because of this you aren't copying the table internally again as a "normal" dataframe solution would.