Counting the result of a left join using dplyr

2019-08-07 01:39发布

问题:

What is the proper way to count the result of a left outer join using dplyr?

Consider the two data frames:

a <- data.frame( id=c( 1, 2, 3, 4 ) )
b <- data.frame( id=c( 1, 1, 3, 3, 3, 4 ), ref_id=c( 'a', 'b', 'c', 'd', 'e', 'f' ) )

a specifies four different IDs. b specifies six records that reference IDs in a. If I want to see how many times each ID is referenced, I might try this:

a %>% left_join( b, by='id' ) %>% group_by( id ) %>% summarise( refs=n() )
Source: local data frame [4 x 2]

     id  refs
  (dbl) (int)
1     1     2
2     2     1
3     3     3
4     4     1

However, the result is misleading because it indicates that ID 2 was referenced once when in reality, it was never referenced (in the intermediate data frame, ref_id was NA for ID 2). I would like to avoid introducing a separate library such as sqldf.

回答1:

With data.table, you can do

library(data.table)
setDT(a); setDT(b)

b[a, .N, on="id", by=.EACHI]


   id N
1:  1 2
2:  2 0
3:  3 3
4:  4 1

Here, the syntax is x[i, j, on, by=.EACHI].

  • .EACHI refers to each row of i=a.
  • j=.N uses a special variable for the number of rows.


回答2:

There are already some good answers but since the question asks not to use packages here is one. We perform a left join on a and b and append a refs column which is TRUE if ref_id is not NA. Then use aggregate to sum over the refs column:

m <- transform(merge(a, b, all.x = TRUE), refs = !is.na(ref_id))
aggregate(refs ~ id, m, sum)

giving:

  id refs
1  1    2
2  2    0
3  3    3
4  4    1


回答3:

I'm having a hard time deciding if this is a hack or the proper way to count references, but this returns the expected result:

a %>% left_join( b, by='id' ) %>% group_by( id ) %>% summarise( refs=sum( !is.na( ref_id ) ) )
Source: local data frame [4 x 2]

     id  refs
  (dbl) (int)
1     1     2
2     2     0
3     3     3
4     4     1


标签: r dplyr