Aggregating factor level counts - by factor

2019-07-24 05:17发布

问题:

I have been trying to make a table displaying the counts of factor levels by another factor. For this, I looked on dozens of pages, questions... trying to use functions in some packages (dplyr, reshape) to get the job done, without any success in using them correctly.

That's what I got:

# my data:
var1 <- c("red","blue","red","blue","red","red","red","red","red","red","red","red","blue","red","blue")
var2 <- c("0","1","0","0","0","0","0","0","0","0","1","0","0","0","0")
var3 <- c("2","2","1","1","1","3","1","2","1","1","3","1","1","2","1")
var4 <- c("0","1","0","0","0","0","1","0","1","1","0","1","0","1","1")
mydata <- data.frame(var1,var2,var3,var4)
head(mydata)

Attempt n+1: displays only the total counts of factors by another factor.

t(aggregate(. ~ var1, mydata, sum))

      [,1]   [,2] 
var1 "blue" "red"
var2 " 5"   "12" 
var3 " 5"   "18" 
var4 " 6"   "16" 

Attempt n+2: it's the correct format but I couldn't get it to work on more than one factor.

library(dplyr)
data1 <- ddply(mydata, c("var1", "var3"), summarise,
            N    = length(var1))
library(reshape)
df1 <- cast(data1, var1 ~ var3, sum)
df1 <- t(df1)
df1

   blue red
1    3   6
2    1   3
3    0   2

What I would like is:

        blue red
var2.0    3  10
var2.1    1   1
var3.1    3   6
var3.2    1   3
var3.3    0   2
var4.0    2   6
var4.1    2   5

How can I get this format? So many thanks in advance,

回答1:

We can melt the dataset by 'var1' and then use table

library(reshape2)
tbl <- table(transform(melt(mydata, id.var="var1"),
        varN = paste(variable, value, sep="."))[c(4,1)])
names(dimnames(tbl)) <- NULL
tbl 
#
#         blue red
#  var2.0    3  10
#  var2.1    1   1
#  var3.1    3   6
#  var3.2    1   3
#  var3.3    0   2
#  var4.0    2   6
#  var4.1    2   5

Or using dplyr/tidyr, we convert the dataset from 'wide' to 'long' format with gather, then unite the columns ('var', 'val') to create 'varV', get the frequency (tally) after grouping by 'var1' and 'varV', and then spread to 'wide' format.

library(dplyr)
library(tidyr)
gather(mydata, var, val, -var1) %>% 
           unite(varV,var, val, sep=".") %>%
           group_by(var1, varV) %>% 
           tally() %>% 
           spread(var1, n, fill = 0)
#    varV  blue   red
#   <chr> <dbl> <dbl>
#1 var2.0     3    10
#2 var2.1     1     1
#3 var3.1     3     6
#4 var3.2     1     3
#5 var3.3     0     2
#6 var4.0     2     6
#7 var4.1     2     5