Applying calculation per groups within R dataframe

I have data like that:

object category country
495647 1        RUS  
477462 2        GER  
431567 3        USA  
449136 1        RUS  
367260 1        USA  
495649 1        RUS  
477461 2        GER  
431562 3        USA  
449133 2        RUS  
367264 2        USA  
...

where one object appears in various (category, country) pairs and countries share a single list of categories.

I'd like to add another column to that, which would be a category weight per country - the number of objects appearing in a category for a category, normalized to sum up to 1 within a country (summation only over unique (category, country) pairs).

I could do something like:

aggregate(df$object, list(df$category, df$country), length)

and then calculate the weight from there, but what's a more efficient and elegant way of doing that directly on the original data.

Desired example output:

object category country weight
495647 1        RUS     .75
477462 2        GER     .5 
431567 3        USA     .5 
449136 1        RUS     .75
367260 1        USA     .25
495649 1        RUS     .75
477461 3        GER     .5
431562 3        USA     .5
449133 2        RUS     .25
367264 2        USA     .25
...

The above would sum up to one within country for unique (category, country) pairs.

标签： r aggregation data.table

3条回答

三岁会撩人

2楼-- · 2019-04-12 21:20

Responding specifically with the final sentence in mind: "What's a more efficient and elegant way of doing that directly on the original data.", it just so happens that data.table has a new feature for this.

install.packages("data.table", repos="http://R-Forge.R-project.org")
# Needs version 1.8.1 from R-Forge.  Soon to be released to CRAN.

With your data in DT :

> DT[, countcat:=.N, by=list(country,category)]     # add 'countcat' column
    category country countcat
 1:        1     RUS        3
 2:        2     GER        1
 3:        3     USA        2
 4:        1     RUS        3
 5:        1     USA        1
 6:        1     RUS        3
 7:        3     GER        1
 8:        3     USA        2
 9:        2     RUS        1
10:        2     USA        1

> DT[, weight:=countcat/.N, by=country]     # add 'weight' column
    category country countcat weight
 1:        1     RUS        3   0.75
 2:        2     GER        1   0.50
 3:        3     USA        2   0.50
 4:        1     RUS        3   0.75
 5:        1     USA        1   0.25
 6:        1     RUS        3   0.75
 7:        3     GER        1   0.50
 8:        3     USA        2   0.50
 9:        2     RUS        1   0.25
10:        2     USA        1   0.25

:= adds a column by reference to the data and is an 'old' feature. The new feature is that it now works by group. .N is a symbol that holds the number of rows in each group.

These operations are memory efficient and should scale to large data; e.g., 1e8, 1e9 rows.

If you don't wish to include the intermediate column countcat, just remove it afterwards. Again, this is an efficient operation which works instantly regardless of the size of the table (by moving pointers internally).

> DT[,countcat:=NULL]     # remove 'countcat' column
    category country weight
 1:        1     RUS   0.75
 2:        2     GER   0.50
 3:        3     USA   0.50
 4:        1     RUS   0.75
 5:        1     USA   0.25
 6:        1     RUS   0.75
 7:        3     GER   0.50
 8:        3     USA   0.50
 9:        2     RUS   0.25
10:        2     USA   0.25
>

0人赞添加讨论(0) 举报

可以哭但决不认输i

3楼-- · 2019-04-12 21:24

I actually asked a similar question some time ago. data.table is really nice for this, especially now that := by group is implemented, and a self join is not necessary anymore - as illustrated above. the best solution from base R is ave(). tapply() can also be used.

This is similar to the solution above, using ave(). However, I highly recommend you look at data.table.

df$count <- ave(x = df$object, df$country, df$category, FUN = length)
df$weight <- ave(x = df$count, df$country, FUN = function(x) x/length(x))

0人赞添加讨论(0) 举报

SAY GOODBYE

4楼-- · 2019-04-12 21:36

I don't see a readable way to do it in one line. But it can be quite compact.

# Use table to get the counts.
counts <- table(df[,2:3])
# Normalize the table
weights <- t(t(counts)/colSums(counts))
# Use 'matrix' selection by names.
df$weight <- weights[as.matrix(df[,2:3])]

0人赞添加讨论(0) 举报

Applying calculation per groups within R dataframe

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间