I want to create a new computed column (the digest of the text of another column). For you to reproduce I create a df as reproducible example:
df <- data.frame(name = replicate(1000, paste(sample(LETTERS, 20, replace=TRUE), collapse="")),stringsAsFactors=FALSE)
> head(df,3)
name
1 ZKBOZVFKNJBRSDWTUEYR
2 RQPHUECABPQZLKZPTFLG
3 FTBVBEQTRLLUGUVHDKAY
Now I want a 2nd column with the digest of the 'name' col for each row This works very well but it is slow (each md5 is different and it is the corresponding digest of the name column):
> df$md5 <- sapply(df$name, digest)
> head(df, 3)
name md5
1 ZKBOZVFKNJBRSDWTUEYR b8d93a9fe6cefb7a856e79f54bac01f2
2 RQPHUECABPQZLKZPTFLG 52f6acbd939df27e92232904ce094053
3 FTBVBEQTRLLUGUVHDKAY a401a8bc18f0cb367435b77afd353078
But this (using dplyr) does not work and I don't see why: the md5 is the same for each row! In fact it is the digest of the complete df$name, including all the rows. Please, can someone explain to me?
> df <- mutate(df, md5=digest(name))
> head(df, 3)
name md5
1 ZKBOZVFKNJBRSDWTUEYR 10aa31791d0b9288e819763d9a41efd8
2 RQPHUECABPQZLKZPTFLG 10aa31791d0b9288e819763d9a41efd8
3 FTBVBEQTRLLUGUVHDKAY 10aa31791d0b9288e819763d9a41efd8
Again if I go the data table way, it seems that does not work using the standard way for new variables:
> dt <- data.table(df)
> dt[, md5:=digest(name)]
> head(dt,3)
name md5
1: ZKBOZVFKNJBRSDWTUEYR 10aa31791d0b9288e819763d9a41efd8
2: RQPHUECABPQZLKZPTFLG 10aa31791d0b9288e819763d9a41efd8
3: FTBVBEQTRLLUGUVHDKAY 10aa31791d0b9288e819763d9a41efd8
If I force to group then it works again (but slow):
> dt[,md5:=digest(name), by=name]
> head(dt, 3)
name md5
1: ZKBOZVFKNJBRSDWTUEYR b8d93a9fe6cefb7a856e79f54bac01f2
2: RQPHUECABPQZLKZPTFLG 52f6acbd939df27e92232904ce094053
3: FTBVBEQTRLLUGUVHDKAY a401a8bc18f0cb367435b77afd353078
I have also tested tapply and works (creating a factor but my real data as millions of rows and it is very slow).
Then, first, can someone explain to me why the dplyr mutate is not taking the value of each row to compute the digest and why the same think happens with data table notation (unless I group)?
and second, is there a faster way do calculate this digest for all the rows?