Reshape in R without aggregation (for example MTur

2020-04-10 11:42发布

问题:

Ordinarily, I'd use a pretty basic long-to-wide reshape for this, but it seems to be dropping my aggregation variables. The setup is I had a job on mechanical Turk that I performed in triplicate---I want MTurk1, Mturk2, MTurk3's answers to be their own variables in the data frame but uniquely id'd by a field I input with the job, so that I can compare them against each other with a function later.

Current Format:

> head(mturk)
  AssignmentStatus     Input.id  Input.State  Answer.Q1thing
1         Approved       134231           NY         Myguess
2         Approved       134231           NY         Myguess
3         Approved       134231           NY        BadGuess
4        Submitted       134812           CA         Another
5         Approved       134812           CA         Another
6         Approved       134812           CA         Another

I'd like this to become

Input.id   Input.State Answer.Q1thing.1 Answer.Q1thing.2 Answer.Q1thing.3  AssignmentStatus.1 AssignmentStatus.2  AssignmentStatus.3
134231              NY          Myguess          Myguess         BadGuess          Approved             Approved            Approved
134812              CA          Another          Another          Another         Submitted             Approved            Approved

or ideally, if there's a variable that can redo column names in the operation....

Id               State          Answer1          Answer2          Answer3          Status1               Status2             Status3
134231              NY          Myguess          Myguess         BadGuess          Approved             Approved            Approved
134812              CA          Another          Another          Another         Submitted             Approved            Approved

dat <- reshape(mturk, timevar="Answer.Q1thing", idvar=c("Input.id", "Input.state"), direction="wide")

This seems to be failing because most reshape long-to-wide functions expect that the variable that becomes wide itself a categorical text-field---that is, this is not a long-to-wide reshape operation because I don't want a variable named "MyGuess" "BadGuess" and "Another", but I want a generic "Answer.X" variable containing these values. I'm not trying to aggregate in any way, such as mean or sum, just list the value in a new place.

So, two directions for this question:

  1. Does this kind of operation have another name? Is this an unfold, unpivot, uncast or something?
  2. How to do this in R?

回答1:

If your data is in a data.table it's a one-liner can be done as follows:

library(data.table)    
mturk.dt <- as.data.table(mturk)

mturk.dt[, as.list(
         rbind(c(Answer.Q1thing, AssignmentStatus))
         )
        , by=list(Id=Input.id, State=Input.State)]

Note that the by argument handles the name-changing too!


If you want to properly name the other columns, use setnames after the fact or, more dynamically, using setattr within the j=.. argument as follows:

After the Fact:

## Assuming 'res' is the reshaped data.table form above:
## Change the names of the six V1, V2.. columns 
setnames(res, paste0("V", 1:6), c(paste0("Answer", 1:3), paste0("Status", 1:3)))

Dynamically, in j=..

## Use `as.data.table` instead of `as.list`, to preserve new names
mturk.dt[, as.data.table(
         rbind(c(
              setattr(Answer.Q1thing,   "names", paste0("Answer", seq(Answer.Q1thing  )))
            , setattr(AssignmentStatus, "names", paste0("Status", seq(AssignmentStatus)))
            ))
         )
        , by=list(Id=Input.id, State=Input.State)]

       Id State Answer1 Answer2  Answer3  Status1  Status2  Status3
1: 134231    NY Myguess Myguess BadGuess Approved Approved Approved
2: 134812    CA Myguess Myguess BadGuess Approved Approved Approved


回答2:

Using plyr:

res = ddply(dat,.(Input.id,Input.State),
            function(x)unlist(as.character(x$Answer.Q1thing)))
setNames(res,c('Id','State','Answer1','Answer2','Answer3'))
  Id State Answer1 Answer2  Answer3
1 134231    NY Myguess Myguess BadGuess
2 134812    CA Another Another  Another

EDIT

In case you have fewer than 3 answers:

res = ddply(dat,.(Input.id,Input.State),
            function(x)
              {
              xx= unlist(as.character(x$Answer.Q1thing))
              if(length(xx)==3)xx
              else c(xx,rep(NA,3-length(xx)))
            })


回答3:

That was super helpful, @Ricardo and @agstudy. I realized my reshape wasn't working only because it demanded a unique, categorical "timevar". I think in most cases, you do have a categorical label/factor like this that makes it easy, but it is not much harder to count them, and make the count into a label.

I had a second problem was that my number of answers were not consistent; you both gave good help for that, but I was also able to just generate a counter and then implemented my original long-to-wide.

Where the count was < 3, that is where an Input.id only had 2 Answers, I got NAs for this, which is what I wanted.

So altogether:

mturk$idx <- with(mturk, ave(Input.id, Input.id, FUN=seq_along)) # weird!
dat <- reshape(mturk, timevar="idx", idvar=c("Input.id", "Input.state"), direction="wide")

I used the syntax for counting sequences within a group that I found here. This was a little idiosyncratic in the use of the ave() function, but seems to crop up in a couple other answers. Tried rtl, too, but had no luck. Using ave(x,x,seq_along) seems to mostly be a hack to avoid sorting. It's odd to use this work-around for sequences in groups because clearly both count() and rtl() are effectively creating this sequencing under the hood in a temp variable.

I like the way data.table allows this sequencing better.



回答4:

From data.table v1.9.5+, dcast can handle multiple value.var columns, i.e., we can cast multiple columns simultaneously. We can simply do:

dt[, id := seq_len(.N), by=Input.id]
dcast(dt, Input.id + Input.State ~ id, 
        value.var=c("AssignmentStatus", "Answer.Q1thing"))
#    Input.id Input.State 1_AssignmentStatus 2_AssignmentStatus 3_AssignmentStatus
# 1:   134231          NY           Approved           Approved           Approved
# 2:   134812          CA          Submitted           Approved           Approved
#    1_Answer.Q1thing 2_Answer.Q1thing 3_Answer.Q1thing
# 1:          Myguess          Myguess         BadGuess
# 2:          Another          Another          Another

Or everything together in one line:

dcast(dt, Input.id + Input.State ~ dt[, seq_len(.N), by=Input.id]$V1, 
                 value.var=c("AssignmentStatus", "Answer.Q1thing"))