What is the proper way to count the result of a left outer join using dplyr?
Consider the two data frames:
a <- data.frame( id=c( 1, 2, 3, 4 ) )
b <- data.frame( id=c( 1, 1, 3, 3, 3, 4 ), ref_id=c( 'a', 'b', 'c', 'd', 'e', 'f' ) )
a
specifies four different IDs. b
specifies six records that reference IDs in a
. If I want to see how many times each ID is referenced, I might try this:
a %>% left_join( b, by='id' ) %>% group_by( id ) %>% summarise( refs=n() )
Source: local data frame [4 x 2]
id refs
(dbl) (int)
1 1 2
2 2 1
3 3 3
4 4 1
However, the result is misleading because it indicates that ID 2
was referenced once when in reality, it was never referenced (in the intermediate data frame, ref_id was NA
for ID 2). I would like to avoid introducing a separate library such as sqldf
.
With data.table, you can do
library(data.table)
setDT(a); setDT(b)
b[a, .N, on="id", by=.EACHI]
id N
1: 1 2
2: 2 0
3: 3 3
4: 4 1
Here, the syntax is x[i, j, on, by=.EACHI]
.
.EACHI
refers to each row of i=a
.
j=.N
uses a special variable for the number of rows.
There are already some good answers but since the question asks not to use packages here is one. We perform a left join on a
and b
and append a refs
column which is TRUE if ref_id
is not NA. Then use aggregate
to sum over the refs
column:
m <- transform(merge(a, b, all.x = TRUE), refs = !is.na(ref_id))
aggregate(refs ~ id, m, sum)
giving:
id refs
1 1 2
2 2 0
3 3 3
4 4 1
I'm having a hard time deciding if this is a hack or the proper way to count references, but this returns the expected result:
a %>% left_join( b, by='id' ) %>% group_by( id ) %>% summarise( refs=sum( !is.na( ref_id ) ) )
Source: local data frame [4 x 2]
id refs
(dbl) (int)
1 1 2
2 2 0
3 3 3
4 4 1