Recently I am doing all my data manipulations using dplyr and it is an excellent tool for that. However I am unable to melt or cast a data frame using dplyr. Is there any way to do that? Right now I am using reshape2 for this purpose.
I want 'dplyr' solution for:
require(reshape2)
data(iris)
dat <- melt(iris,id.vars="Species")
The successor to reshape2
is tidyr
. The equivalent of melt()
and dcast()
are gather()
and spread()
respectively. The equivalent to your code would then be
library(tidyr)
data(iris)
dat <- gather(iris, variable, value, -Species)
If you have magrittr
imported you can use the pipe operator like in dplyr
, i.e. write
dat <- iris %>% gather(variable, value, -Species)
Note that you need to specify the variable and value names explicitly, unlike in melt()
. I find the syntax of gather()
quite convenient, because you can just specify the columns you want to be converted to long format, or specify the ones you want to remain in the new data frame by prefixing them with '-' (just like for Species above), which is a bit faster to type than in melt()
. However, I've noticed that on my machine at least, tidyr
can be noticeably slower than reshape2
.
Edit In reply to @hadley 's comment below, I'm posting some timing info comparing the two functions on my PC.
library(microbenchmark)
microbenchmark(
melt = melt(iris,id.vars="Species"),
gather = gather(iris, variable, value, -Species)
)
# Unit: microseconds
# expr min lq median uq max neval
# melt 278.829 290.7420 295.797 320.5730 389.626 100
# gather 536.974 552.2515 567.395 683.2515 1488.229 100
set.seed(1)
iris1 <- iris[sample(1:nrow(iris), 1e6, replace = T), ]
system.time(melt(iris1,id.vars="Species"))
# user system elapsed
# 0.012 0.024 0.036
system.time(gather(iris1, variable, value, -Species))
# user system elapsed
# 0.364 0.024 0.387
sessionInfo()
# R version 3.1.1 (2014-07-10)
# Platform: x86_64-pc-linux-gnu (64-bit)
#
# locale:
# [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C
# [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8
# [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8
# [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C
# [9] LC_ADDRESS=C LC_TELEPHONE=C
# [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
# attached base packages:
# [1] stats graphics grDevices utils datasets methods base
#
# other attached packages:
# [1] reshape2_1.4 microbenchmark_1.3-0 magrittr_1.0.1
# [4] tidyr_0.1
#
# loaded via a namespace (and not attached):
# [1] assertthat_0.1 dplyr_0.2 parallel_3.1.1 plyr_1.8.1 Rcpp_0.11.2
# [6] stringr_0.6.2 tools_3.1.1
In addition, cast can be using tidyr::spread()
Example for you
library(reshape2)
library(tidyr)
library(dplyr)
# example data : `mini_iris`
(mini_iris <- iris[c(1, 51, 101), ])
# melt
(melted1 <- mini_iris %>% melt(id.vars = "Species")) # on reshape2
(melted2 <- mini_iris %>% gather(variable, value, -Species)) # on tidyr
# cast
melted1 %>% dcast(Species ~ variable, value.var = "value") # on reshape2
melted2 %>% spread(variable, value) # on tidyr
To add to answers above using @Lovetoken's mini_iris
example (this is too complex for a comment) - for those newcomers who do not understand what is meant by melt and casting.
library(reshape2)
library(tidyr)
library(dplyr)
# example data : `mini_iris`
mini_iris <- iris[c(1, 51, 101), ]
# mini_iris
#Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#1 5.1 3.5 1.4 0.2 setosa
#51 7.0 3.2 4.7 1.4 versicolor
#101 6.3 3.3 6.0 2.5 virginica
Melt is taking the dataframe and expanding into a long list of values. Not efficient but can be useful if you need to combine sets of data. Think of the structure of an icecube melting on a tabletop and spreading out.
melted1 <- testiris %>% melt(id.vars = "Species")
> nrow(melted1)
[1] 12
head(melted1)
# Species variable value
# 1 setosa Sepal.Length 5.1
# 2 versicolor Sepal.Length 7.0
# 3 virginica Sepal.Length 6.3
# 4 setosa Sepal.Width 3.5
# 5 versicolor Sepal.Width 3.2
# 6 virginica Sepal.Width 3.3
You can see how the data has now been broken into many rows of value. The column names are now text within a variable column.
casting will reassemble back to a data.table or data.frame.