New dataframe column as function (digest) of anoth

2019-07-13 08:24发布

问题:

I want to create a new computed column (the digest of the text of another column). For you to reproduce I create a df as reproducible example:

df <- data.frame(name = replicate(1000, paste(sample(LETTERS, 20, replace=TRUE), collapse="")),stringsAsFactors=FALSE)

> head(df,3)
              name
1 ZKBOZVFKNJBRSDWTUEYR
2 RQPHUECABPQZLKZPTFLG
3 FTBVBEQTRLLUGUVHDKAY

Now I want a 2nd column with the digest of the 'name' col for each row This works very well but it is slow (each md5 is different and it is the corresponding digest of the name column):

> df$md5 <- sapply(df$name, digest)   
> head(df, 3)
              name                              md5
1 ZKBOZVFKNJBRSDWTUEYR b8d93a9fe6cefb7a856e79f54bac01f2
2 RQPHUECABPQZLKZPTFLG 52f6acbd939df27e92232904ce094053
3 FTBVBEQTRLLUGUVHDKAY a401a8bc18f0cb367435b77afd353078

But this (using dplyr) does not work and I don't see why: the md5 is the same for each row! In fact it is the digest of the complete df$name, including all the rows. Please, can someone explain to me?

> df <- mutate(df, md5=digest(name))
> head(df, 3)
                  name                              md5
1 ZKBOZVFKNJBRSDWTUEYR 10aa31791d0b9288e819763d9a41efd8
2 RQPHUECABPQZLKZPTFLG 10aa31791d0b9288e819763d9a41efd8
3 FTBVBEQTRLLUGUVHDKAY 10aa31791d0b9288e819763d9a41efd8

Again if I go the data table way, it seems that does not work using the standard way for new variables:

> dt <- data.table(df)
> dt[, md5:=digest(name)]  
> head(dt,3)
                   name                              md5
1: ZKBOZVFKNJBRSDWTUEYR 10aa31791d0b9288e819763d9a41efd8
2: RQPHUECABPQZLKZPTFLG 10aa31791d0b9288e819763d9a41efd8
3: FTBVBEQTRLLUGUVHDKAY 10aa31791d0b9288e819763d9a41efd8

If I force to group then it works again (but slow):

> dt[,md5:=digest(name), by=name]   
> head(dt, 3)
                   name                              md5
1: ZKBOZVFKNJBRSDWTUEYR b8d93a9fe6cefb7a856e79f54bac01f2
2: RQPHUECABPQZLKZPTFLG 52f6acbd939df27e92232904ce094053
3: FTBVBEQTRLLUGUVHDKAY a401a8bc18f0cb367435b77afd353078

I have also tested tapply and works (creating a factor but my real data as millions of rows and it is very slow).

Then, first, can someone explain to me why the dplyr mutate is not taking the value of each row to compute the digest and why the same think happens with data table notation (unless I group)?

and second, is there a faster way do calculate this digest for all the rows?

回答1:

Considering you have a very large dataset, it's better to test the different approaches on a somewhat larger dataset (for this example I use 100000 rows, bigger datasets take ages on my system):

df <- data.frame(name = replicate(1e5, paste(sample(LETTERS, 20, replace=TRUE), collapse="")), stringsAsFactors=FALSE)

First, let's consider several approaches available:

# base R
df$md5 <- sapply(df$name, digest)

# data.table (grouping by name, based on the assumption that all names are unique)
dt[, md5:=digest(name), name]

# data.table with a unique identifier for each row
dt[,indx:=.I][, md5:=digest(name), indx]

# dplyr (grouping by name, based on the assumption that all names are unique)
df %>% group_by(name) %>% mutate(md5=digest(name))

# dplyr with rowwise (from the other answer)
df %>% rowwise() %>% mutate(md5=digest(name))

Second, test which appraoch is the fastest:

library(rbenchmark)
benchmark(replications = 10, order = "elapsed", columns = c("test", "elapsed", "relative"),
          baseR = df$md5 <- sapply(df$name, digest),
          dtbl1 = dt[, md5:=digest(name), name],
          dtbl2 = dt[,indx:=.I][, md5:=digest(name), indx],
          dplyr = df %>% group_by(name) %>% mutate(md5=digest(name)),
          rowwi = df %>% rowwise() %>% mutate(md5=digest(name)))

which gives:

   test elapsed relative
2 dtbl1  77.878    1.000
3 dtbl2  78.343    1.006
1 baseR  81.399    1.045
5 rowwi 118.799    1.525
4 dplyr 129.748    1.666

So, sticking to a base R solution isn't such a bad choice at all. I suspect that the reason why it's slow on your real dataset is probably the digest function and not some misbehavior of a certain package/function.



回答2:

The reason that you are getting the same md5 value is because the digest function is not a vectorized function. To fix this, place rowwise before mutate as in:

df <- data.frame(name = replicate(1000, paste(sample(LETTERS, 20, replace=TRUE), collapse="")),stringsAsFactors=FALSE)
ptm <- proc.time()
df %>% rowwise() %>% mutate(md5=digest(name)) %>% print(n=3)

1  SSYNAIPPMBNICTXCTZMH cf06eaeab2a4b1b3f0fb964e91867702
2  XAFNBFYOXSDIFSSCGKKX 28cb7f90ac14f4a2ee5743a1dce91ac7
3  TMWBHOHWVDSRUPBGKYGS a248a7eb31657555b2bf8b463b7e3ce3
..                  ...                              ...

proc.time() - ptm
user  system elapsed 
0.09    0.00    0.09 

As for the speed, you can see it only took 1/10 of a second on my desktop.