I want to reshape my dataframe from long to wide format and I loose some data that I'd like to keep. For the following example:
df <- data.frame(Par1 = unlist(strsplit("AABBCCC","")),
Par2 = unlist(strsplit("DDEEFFF","")),
ParD = unlist(strsplit("foo,bar,baz,qux,bla,xyz,meh",",")),
Type = unlist(strsplit("pre,post,pre,post,pre,post,post",",")),
Val = c(10,20,30,40,50,60,70))
# Par1 Par2 ParD Type Val
# 1 A D foo pre 10
# 2 A D bar post 20
# 3 B E baz pre 30
# 4 B E qux post 40
# 5 C F bla pre 50
# 6 C F xyz post 60
# 7 C F meh post 70
dfw <- dcast(df,
formula = Par1 + Par2 ~ Type,
value.var = "Val",
fun.aggregate = mean)
# Par1 Par2 post pre
# 1 A D 20 10
# 2 B E 40 30
# 3 C F 65 50
this is almost what I need but I would like to have
- some field keeping data from
ParD
field (for example, as single merged string), - number of observations used for aggregation.
i.e. I would like the resulting data.frame to be as follows:
# Par1 Par2 post pre Num.pre Num.post ParD
# 1 A D 20 10 1 1 foo_bar
# 2 B E 40 30 1 1 baz_qux
# 3 C F 65 50 1 2 bla_xyz_meh
I would be grateful for any ideas. For example, I tried to solve the second task by writing in dcast: fun.aggregate=function(x) c(Val=mean(x),Num=length(x))
- but this causes an error.
One Step solution combining
reshape::cast
withplyr::ddply
NOTE that the
dcast
function inreshape2
does not allow multiple aggregate functions to be passed, while thecast
function inreshape
does.Trying to wrap different aggregation expressions into a self-contained function (expressions should yield atomic values)...
Solution in 2 steps using
ddply
( i am not happy with , but I get the result)You could do a merge of two dcasts and an aggregate, here all wrapped into one large expression mostly to avoid having intermediate objects hanging around afterwards:
I'll post but agstudy's puts me to shame:
Yielding:
What a great opprotunity to benchmark! Below are some runs of the
plyr
method (as suggested by @agstudy) compared with thedata.table
method (as suggested by @Arun) using different sample sizes (N = 900, 2700, 10800)Summary:
The
data.table
method outperforms theplyr
method by a factor of 7.5RESULTS
Update : Added results for baseMethod1
Update 2: Added keying the DT as part of the metric
Adding the indexing step to the benchmark for fairness as per @MatthewDowle s comment.
However, presumably, if data.table is used, it will be in place of the data.frame and hence the indexing will occur once and not simply for this procedure
Update 3: Benchmarking @MD's edits to @Arun's original answer