I have data like that:
object category country
495647 1 RUS
477462 2 GER
431567 3 USA
449136 1 RUS
367260 1 USA
495649 1 RUS
477461 2 GER
431562 3 USA
449133 2 RUS
367264 2 USA
...
where one object appears in various (category, country)
pairs and countries share a single list of categories.
I'd like to add another column to that, which would be a category weight per country - the number of objects appearing in a category for a category, normalized to sum up to 1 within a country (summation only over unique (category, country)
pairs).
I could do something like:
aggregate(df$object, list(df$category, df$country), length)
and then calculate the weight from there, but what's a more efficient and elegant way of doing that directly on the original data.
Desired example output:
object category country weight
495647 1 RUS .75
477462 2 GER .5
431567 3 USA .5
449136 1 RUS .75
367260 1 USA .25
495649 1 RUS .75
477461 3 GER .5
431562 3 USA .5
449133 2 RUS .25
367264 2 USA .25
...
The above would sum up to one within country for unique (category, country)
pairs.
Responding specifically with the final sentence in mind: "What's a more efficient and elegant way of doing that directly on the original data.", it just so happens that
data.table
has a new feature for this.With your data in
DT
::=
adds a column by reference to the data and is an 'old' feature. The new feature is that it now works by group..N
is a symbol that holds the number of rows in each group.These operations are memory efficient and should scale to large data; e.g.,
1e8
,1e9
rows.If you don't wish to include the intermediate column
countcat
, just remove it afterwards. Again, this is an efficient operation which works instantly regardless of the size of the table (by moving pointers internally).I actually asked a similar question some time ago. data.table is really nice for this, especially now that := by group is implemented, and a self join is not necessary anymore - as illustrated above. the best solution from base R is
ave()
.tapply()
can also be used.This is similar to the solution above, using
ave()
. However, I highly recommend you look at data.table.I don't see a readable way to do it in one line. But it can be quite compact.