Sum columns based on multiple other factor columns

2019-08-06 05:49发布

I have the following dataframe:

df<-structure(list(totprivland = c(175L, 50L, 100L, 14L, 4L, 240L, 
10L, 20L, 20L, 58L), ncushr8d1 = c(0L, 0L, 0L, 0L, 0L, 30L, 5L, 
0L, 0L, 50L), ncu_CENREG1 = structure(c(4L, 4L, 4L, 4L, 1L, 3L, 
3L, 3L, 4L, 4L), .Label = c("Northeast", "Midwest", "South", 
"West"), class = "factor"), ncushr8d2 = c(75L, 50L, 100L, 14L, 
2L, 30L, 5L, 20L, 20L, 8L), ncu_CENREG2 = structure(c(4L, 4L, 
4L, 4L, 1L, 2L, 1L, 4L, 3L, 4L), .Label = c("Northeast", "Midwest", 
"South", "West"), class = "factor"), ncushr8d3 = c(100L, NA, 
NA, NA, 2L, 180L, 0L, NA, NA, NA), ncu_CENREG3 = structure(c(4L, 
NA, NA, NA, 1L, 1L, 3L, NA, NA, NA), .Label = c("Northeast", 
"Midwest", "South", "West"), class = "factor"), ncushr8d4 = c(NA, 
NA, NA, NA, 0L, NA, NA, NA, NA, NA), ncu_CENREG4 = structure(c(NA, 
NA, NA, NA, 1L, NA, NA, NA, NA, NA), .Label = c("Northeast", 
"Midwest", "South", "West"), class = "factor")), .Names = c("totprivland", 
"ncushr8d1", "ncu_CENREG1", "ncushr8d2", "ncu_CENREG2", "ncushr8d3", 
"ncu_CENREG3", "ncushr8d4", "ncu_CENREG4"), row.names = c(27404L, 
27525L, 27576L, 27822L, 28099L, 28238L, 28306L, 28312L, 28348L, 
28379L), class = "data.frame")

=======

Which is the dput for the following basic idea:

Total    VariableA  LocationA    VariableB     LocationB
30            20       East          10         East
20            20       South         NA         West
115           15       East         100         South
100           50       West          50         West 
35            10       East          25         South  

The total (or totprivland in dput example) is a sum of the variables (ncushr8d1, ncushr8d2, ncushr8d3, and ncushr8d4) and each variable has a corresponding factor location variable (ncu_CENREG1, etc.). There are 6 additional variables and locations in this same pattern. The location variable is oftentimes the same value for multiple numeric variables (e.g. multiple 'East' location values like the first row of the example).

I'd like to get a sum of the values for each row by the common location factor, creating a new column for each location's sum. It would look something like this, with the ability to ignore NA values:

Total    VariableA  LocationA    VariableB     LocationB   TotalWest  TotalEast TotalSouth
30            20       East          10         East          0          30          0
20            20       South         NA         NA            0           0         20
115           15       East         100         South         0          15        100
100           50       West          50         West        100           0          0 
35            10       East          25         South         0          10         25

I've looked into aggregate and splitting but can't seem to figure out how to get them to work across so many columns. I'm also considering a lengthy "if" statement that would rotate through all 8 variables and their corresponding locations, but feel that there must be a better solution for this. The observations are weighted for use in the survey package and I'd like to avoid duplicating the observations and making them "long" with the reshape package, although maybe I can re-combine them later. Any suggestions appreciated!

Much thanks, Luke

标签: r sum aggregate
1条回答
成全新的幸福
2楼-- · 2019-08-06 06:46

First, I'd convert the data frame to a long form in which you have 3 columns: value, location, case. case should indicate from which case (e.g. row) the data came from. order doesn't matter. so your data frame will look something like:

Value    Loc    Case
20       East   1
20       South  2
...
10       East   1

and so forth... one way to do that is to stack your values and locations, and then manually (and easily) add case numbers. suppose your original dataframe is called df, and has values in columns 2,4 and locations in columns 3,5

v.col = stack(df[,c(2,4)])[,1]
v.loc = stack(df[,c(3,5)])[,1]
v.case = rep(1:nrow(df),2)
long.data = data.frame(v.col,v.loc,v.case)    # this is not actually needed, but just so you can view it

now use tapply to create the columns you need

s = tapply(v.col,list(v.case,v.loc),sum,na.rm=T)
new.df = cbind(df,s)

You'll probably need to adjust NA to 0 or something, but this should be easy.

There are also probably easier ways to do this using the plyr/reshape packages, but I am not expert on these.

Hope this helps

查看更多
登录 后发表回答