I have the following dataframe:
df<-structure(list(totprivland = c(175L, 50L, 100L, 14L, 4L, 240L,
10L, 20L, 20L, 58L), ncushr8d1 = c(0L, 0L, 0L, 0L, 0L, 30L, 5L,
0L, 0L, 50L), ncu_CENREG1 = structure(c(4L, 4L, 4L, 4L, 1L, 3L,
3L, 3L, 4L, 4L), .Label = c("Northeast", "Midwest", "South",
"West"), class = "factor"), ncushr8d2 = c(75L, 50L, 100L, 14L,
2L, 30L, 5L, 20L, 20L, 8L), ncu_CENREG2 = structure(c(4L, 4L,
4L, 4L, 1L, 2L, 1L, 4L, 3L, 4L), .Label = c("Northeast", "Midwest",
"South", "West"), class = "factor"), ncushr8d3 = c(100L, NA,
NA, NA, 2L, 180L, 0L, NA, NA, NA), ncu_CENREG3 = structure(c(4L,
NA, NA, NA, 1L, 1L, 3L, NA, NA, NA), .Label = c("Northeast",
"Midwest", "South", "West"), class = "factor"), ncushr8d4 = c(NA,
NA, NA, NA, 0L, NA, NA, NA, NA, NA), ncu_CENREG4 = structure(c(NA,
NA, NA, NA, 1L, NA, NA, NA, NA, NA), .Label = c("Northeast",
"Midwest", "South", "West"), class = "factor")), .Names = c("totprivland",
"ncushr8d1", "ncu_CENREG1", "ncushr8d2", "ncu_CENREG2", "ncushr8d3",
"ncu_CENREG3", "ncushr8d4", "ncu_CENREG4"), row.names = c(27404L,
27525L, 27576L, 27822L, 28099L, 28238L, 28306L, 28312L, 28348L,
28379L), class = "data.frame")
=======
Which is the dput
for the following basic idea:
Total VariableA LocationA VariableB LocationB
30 20 East 10 East
20 20 South NA West
115 15 East 100 South
100 50 West 50 West
35 10 East 25 South
The total (or totprivland in dput example) is a sum of the variables (ncushr8d1, ncushr8d2, ncushr8d3, and ncushr8d4) and each variable has a corresponding factor location variable (ncu_CENREG1, etc.). There are 6 additional variables and locations in this same pattern. The location variable is oftentimes the same value for multiple numeric variables (e.g. multiple 'East' location values like the first row of the example).
I'd like to get a sum of the values for each row by the common location factor, creating a new column for each location's sum. It would look something like this, with the ability to ignore NA values:
Total VariableA LocationA VariableB LocationB TotalWest TotalEast TotalSouth
30 20 East 10 East 0 30 0
20 20 South NA NA 0 0 20
115 15 East 100 South 0 15 100
100 50 West 50 West 100 0 0
35 10 East 25 South 0 10 25
I've looked into aggregate and splitting but can't seem to figure out how to get them to work across so many columns. I'm also considering a lengthy "if" statement that would rotate through all 8 variables and their corresponding locations, but feel that there must be a better solution for this. The observations are weighted for use in the survey package and I'd like to avoid duplicating the observations and making them "long" with the reshape package, although maybe I can re-combine them later. Any suggestions appreciated!
Much thanks, Luke
First, I'd convert the data frame to a long form in which you have 3 columns: value, location, case. case should indicate from which case (e.g. row) the data came from. order doesn't matter. so your data frame will look something like:
and so forth... one way to do that is to stack your values and locations, and then manually (and easily) add case numbers. suppose your original dataframe is called df, and has values in columns 2,4 and locations in columns 3,5
now use tapply to create the columns you need
You'll probably need to adjust NA to 0 or something, but this should be easy.
There are also probably easier ways to do this using the plyr/reshape packages, but I am not expert on these.
Hope this helps