I'm using reshape in R to compute aggregate statistics over columns of a data.frame. Here's my data.frame:
> df
a a b b ID
1 1 1 1 1 1
2 2 3 2 3 2
3 3 5 3 5 3
which is just a little test data.frame to try and understand the reshape package. I melt, and then cast, to try and find the mean of the a
s and the b
s:
> melt(df, id = "ID") -> df.m
> cast(df.m, ID ~ variable, fun = mean)
ID a b
1 1 1 1
2 2 2 2
3 3 3 3
Argh! What? Was hoping the mean of c(2,3)
was 2.5 and so on. What's going on? Here's a thing:
> df.m
ID variable value
1 1 a 1
2 2 a 2
3 3 a 3
4 1 a 1
5 2 a 2
6 3 a 3
7 1 b 1
8 2 b 2
9 3 b 3
10 1 b 1
11 2 b 2
12 3 b 3
what's going on? Where did both my 5
s go? Do I have a very basic misunderstanding going on here? If so: what is it?
I updated my answer here to fix this: R: aggregate columns of a data.frame
Apparently, if your data frame doesn't have unique column names, they won't melt properly.
Edit: Instead of having column names of
a a a b b
, apparently you need to have unique column names formelt()
to work properly. Minimallya.1 a.2 a.3 b.1 b.2
, or something. After usingmelt()
, your options to get sensible levels forvariable
is either to usegsub()
on the levels ofvariable
to eliminate the disambiguating values, or to usecolsplit()
to create two new columns. For the dummy names I just gave, that would look like:This is not a valid data frame because the columns do not have unique names.