Must one melt
a data frame prior to having it cast? From ?melt
:
data molten data frame, see melt.
In other words, is it absolutely necessary to have a data frame molten prior to any acast
or dcast
operation?
Consider the following:
library("reshape2")
library("MASS")
xb <- dcast(Cars93, Manufacturer ~ Type, mean, value.var="Price")
m.Cars93 <- melt(Cars93, id.vars=c("Manufacturer", "Type"), measure.vars="Price")
xc <- dcast(m.Cars93, Manufacturer ~ Type, mean, value.var="value")
Then:
> identical(xb, xc)
[1] TRUE
So in this case the melt
operation seems to have been redundant.
What are the general guiding rules in these cases? How do you decide when a data frame needs to be molten prior to a *cast
operation?
Whether or not you need to melt
your dataset depends on what form you want the final data to be in and how that relates to what you currently have.
The way I generally think of it is:
- For the LHS of the formula, I should have one or more columns that will become my "id" rows. These will remain as separate columns in the final output.
- For the RHS of the formula, I should have one or more columns that combine to form new columns in which I will be "spreading" my values out across. When this is more than one column,
dcast
will create new columns based on the combination of the values.
- I must have just one column that would feed the values to fill in the resulting "grid" created by these rows and columns.
To illustrate with a small example, consider this tiny dataset:
mydf <- data.frame(
A = c("A", "A", "B", "B", "B"),
B = c("a", "b", "a", "b", "c"),
C = c(1, 1, 2, 2, 3),
D = c(1, 2, 3, 4, 5),
E = c(6, 7, 8, 9, 10)
)
Imagine that our possible value variables are columns "D" or "E", but we are only interested in the values from "E". Imagine also that our primary "id" is column "A", and we want to spread the values out according to column "B". Column "C" is irrelevant at this point.
With that scenario, we would not need to melt
the data first. We could simply do:
library(reshape2)
dcast(mydf, A ~ B, value.var = "E")
# A a b c
# 1 A 6 7 NA
# 2 B 8 9 10
Compare what happens when you do the following, keeping in mind my three points above:
dcast(mydf, A ~ C, value.var = "E")
dcast(mydf, A ~ B + C, value.var = "E")
dcast(mydf, A + B ~ C, value.var = "E")
When is melt
required?
Now, let's make one small adjustment to the scenario: We want to spread out the values from both columns "D" and "E" with no actual aggregation taking place. With this change, we need to melt
the data first so that the relevant values that need to be spread out are in a single column (point 3 above).
dfL <- melt(mydf, measure.vars = c("D", "E"))
dcast(dfL, A ~ B + variable, value.var = "value")
# A a_D a_E b_D b_E c_D c_E
# 1 A 1 6 2 7 NA NA
# 2 B 3 8 4 9 5 10