Applying calculation per groups within R dataframe

2019-04-12 20:43发布

问题:

I have data like that:

object category country
495647 1        RUS  
477462 2        GER  
431567 3        USA  
449136 1        RUS  
367260 1        USA  
495649 1        RUS  
477461 2        GER  
431562 3        USA  
449133 2        RUS  
367264 2        USA  
...

where one object appears in various (category, country) pairs and countries share a single list of categories.

I'd like to add another column to that, which would be a category weight per country - the number of objects appearing in a category for a category, normalized to sum up to 1 within a country (summation only over unique (category, country) pairs).

I could do something like:

aggregate(df$object, list(df$category, df$country), length)

and then calculate the weight from there, but what's a more efficient and elegant way of doing that directly on the original data.

Desired example output:

object category country weight
495647 1        RUS     .75
477462 2        GER     .5 
431567 3        USA     .5 
449136 1        RUS     .75
367260 1        USA     .25
495649 1        RUS     .75
477461 3        GER     .5
431562 3        USA     .5
449133 2        RUS     .25
367264 2        USA     .25
...

The above would sum up to one within country for unique (category, country) pairs.

回答1:

Responding specifically with the final sentence in mind: "What's a more efficient and elegant way of doing that directly on the original data.", it just so happens that data.table has a new feature for this.

install.packages("data.table", repos="http://R-Forge.R-project.org")
# Needs version 1.8.1 from R-Forge.  Soon to be released to CRAN.

With your data in DT :

> DT[, countcat:=.N, by=list(country,category)]     # add 'countcat' column
    category country countcat
 1:        1     RUS        3
 2:        2     GER        1
 3:        3     USA        2
 4:        1     RUS        3
 5:        1     USA        1
 6:        1     RUS        3
 7:        3     GER        1
 8:        3     USA        2
 9:        2     RUS        1
10:        2     USA        1

> DT[, weight:=countcat/.N, by=country]     # add 'weight' column
    category country countcat weight
 1:        1     RUS        3   0.75
 2:        2     GER        1   0.50
 3:        3     USA        2   0.50
 4:        1     RUS        3   0.75
 5:        1     USA        1   0.25
 6:        1     RUS        3   0.75
 7:        3     GER        1   0.50
 8:        3     USA        2   0.50
 9:        2     RUS        1   0.25
10:        2     USA        1   0.25

:= adds a column by reference to the data and is an 'old' feature. The new feature is that it now works by group. .N is a symbol that holds the number of rows in each group.

These operations are memory efficient and should scale to large data; e.g., 1e8, 1e9 rows.

If you don't wish to include the intermediate column countcat, just remove it afterwards. Again, this is an efficient operation which works instantly regardless of the size of the table (by moving pointers internally).

> DT[,countcat:=NULL]     # remove 'countcat' column
    category country weight
 1:        1     RUS   0.75
 2:        2     GER   0.50
 3:        3     USA   0.50
 4:        1     RUS   0.75
 5:        1     USA   0.25
 6:        1     RUS   0.75
 7:        3     GER   0.50
 8:        3     USA   0.50
 9:        2     RUS   0.25
10:        2     USA   0.25
> 


回答2:

I actually asked a similar question some time ago. data.table is really nice for this, especially now that := by group is implemented, and a self join is not necessary anymore - as illustrated above. the best solution from base R is ave(). tapply() can also be used.

This is similar to the solution above, using ave(). However, I highly recommend you look at data.table.

df$count <- ave(x = df$object, df$country, df$category, FUN = length)
df$weight <- ave(x = df$count, df$country, FUN = function(x) x/length(x))


回答3:

I don't see a readable way to do it in one line. But it can be quite compact.

# Use table to get the counts.
counts <- table(df[,2:3])
# Normalize the table
weights <- t(t(counts)/colSums(counts))
# Use 'matrix' selection by names.
df$weight <- weights[as.matrix(df[,2:3])]