Must one `melt` a dataframe before having it `cast

2019-04-13 06:41发布

Must one melt a data frame prior to having it cast? From ?melt:

data    molten data frame, see melt.

In other words, is it absolutely necessary to have a data frame molten prior to any acast or dcast operation?

Consider the following:

library("reshape2")
library("MASS")

xb <- dcast(Cars93, Manufacturer ~ Type, mean, value.var="Price")
m.Cars93 <- melt(Cars93, id.vars=c("Manufacturer", "Type"), measure.vars="Price")
xc <- dcast(m.Cars93, Manufacturer ~ Type, mean, value.var="value")

Then:

> identical(xb, xc)
[1] TRUE

So in this case the melt operation seems to have been redundant.

What are the general guiding rules in these cases? How do you decide when a data frame needs to be molten prior to a *cast operation?

1条回答
迷人小祖宗
2楼-- · 2019-04-13 07:03

Whether or not you need to melt your dataset depends on what form you want the final data to be in and how that relates to what you currently have.

The way I generally think of it is:

  1. For the LHS of the formula, I should have one or more columns that will become my "id" rows. These will remain as separate columns in the final output.
  2. For the RHS of the formula, I should have one or more columns that combine to form new columns in which I will be "spreading" my values out across. When this is more than one column, dcast will create new columns based on the combination of the values.
  3. I must have just one column that would feed the values to fill in the resulting "grid" created by these rows and columns.

To illustrate with a small example, consider this tiny dataset:

mydf <- data.frame(
  A = c("A", "A", "B", "B", "B"),
  B = c("a", "b", "a", "b", "c"),
  C = c(1, 1, 2, 2, 3),
  D = c(1, 2, 3, 4, 5),
  E = c(6, 7, 8, 9, 10)
)

Imagine that our possible value variables are columns "D" or "E", but we are only interested in the values from "E". Imagine also that our primary "id" is column "A", and we want to spread the values out according to column "B". Column "C" is irrelevant at this point.

With that scenario, we would not need to melt the data first. We could simply do:

library(reshape2)
dcast(mydf, A ~ B, value.var = "E")
#   A a b  c
# 1 A 6 7 NA
# 2 B 8 9 10

Compare what happens when you do the following, keeping in mind my three points above:

dcast(mydf, A ~ C, value.var = "E")
dcast(mydf, A ~ B + C, value.var = "E")
dcast(mydf, A + B ~ C, value.var = "E")

When is melt required?

Now, let's make one small adjustment to the scenario: We want to spread out the values from both columns "D" and "E" with no actual aggregation taking place. With this change, we need to melt the data first so that the relevant values that need to be spread out are in a single column (point 3 above).

dfL <- melt(mydf, measure.vars = c("D", "E"))
dcast(dfL, A ~ B + variable, value.var = "value")
#   A a_D a_E b_D b_E c_D c_E
# 1 A   1   6   2   7  NA  NA
# 2 B   3   8   4   9   5  10
查看更多
登录 后发表回答