Left join using data.table

2019-01-02 21:20发布

Suppose I have two data.table's:

A:

  A  B
1: 1 12
2: 2 13
3: 3 14
4: 4 15

B:

   A  B
1: 2 13
2: 3 14

and I have the following code:

merge_test = merge(dataA, dataB, by="A", all.data=TRUE)

I get:

   A B.x B.y
1: 2  13  13
2: 3  14  14

However, I want all the rows in dataA in the final merged table. Is there a way to do this?

2条回答
不流泪的眼
2楼-- · 2019-01-02 21:32

If you want to add the b values of B to A, then it's best to join A with B and update A by reference as follows:

A[B, on = 'a', bb := i.b]

which gives:

> A
   a  b bb
1: 1 12 NA
2: 2 13 13
3: 3 14 14
4: 4 15 NA

This is a better appraoch than using B[A, on='a'] because the latter just prints the result to the console. When you want to get the results back into A, you need to use A <- B[A, on='a'] which will give you the same result.

The reason why A[B, on = 'a', bb := i.b] is better than A <- B[A, on = 'a'] is memory efficiency. With A[B, on = 'a', bb := i.b] the location of A in memory stays the same:

> address(A)
[1] "0x102afa5d0"
> A[B, on = 'a', bb := i.b]
> address(A)
[1] "0x102afa5d0"

While on the other hand with A <- B[A, on = 'a'], a new object is created and saved in memory as A and hence has another location in memory:

> address(A)
[1] "0x102abae50"
> A <- B[A, on = 'a']
> address(A)
[1] "0x102aa7e30"

Using merge (merge.data.table) results in a similar change in memory location:

> address(A)
[1] "0x111897e00"
> A <- merge(A, B, by = 'a', all.x = TRUE)
> address(A)
[1] "0x1118ab000"

For memory efficiency it is thus better to use an 'update-by-reference-join' syntax:

A[B, on = 'a', bb := i.b] 

Although this doesn't make a noticable difference with small datasets like these, it does make a difference on large datasets for which data.table was designed.

Probably also worth mentioning is that the order of A stays the same.


Used data:

A <- data.table(a = 1:4, b = 12:15)
B <- data.table(a = 2:3, b = 13:14)
查看更多
柔情千种
3楼-- · 2019-01-02 21:49

You can try this:

# used data
# set the key in 'B' to the column which you use to join
A <- data.table(a = 1:4, b = 12:15)
B <- data.table(a = 2:3, b = 13:14, key = 'a') 

B[A]
查看更多
登录 后发表回答