Use of match within i of data.table

2019-05-07 20:27发布

问题:

The %in% operator is a wrapper for the match function returning "a vector of the same length as x". For instance:

> match(c("a", "b", "c"), c("a", "a"), nomatch = 0) > 0
## [1]  TRUE FALSE FALSE

When used within i of data.table, however

(dt1 <- data.table(v1 = c("a", "b", "c"), v2 = "dt1"))
   v1  v2
1:  a dt1
2:  b dt1
3:  c dt1
(dt2 <- data.table(v1 = c("a", "a"), v2 = "dt2"))
   v1  v2
1:  a dt2
2:  a dt2
dt1[v1 %in% dt2$v1]
   v1  v2
1:  a dt1
2:  a dt1

duplicates are obtained. Should the expected behaviour of %in% within i of data.table not give the same result as

dt1[dt1$v1 %in% dt2$v1]  
   v1  v2
1:  a dt1

i.e. without duplicates?

回答1:

This was a bug in data.table V < 1.9.5 automatic indexing that was fixed in V >= 1.9.5.

I can think of 3 possible workarounds:

Disable the auto indexing and use base R %in% as in

options(datatable.auto.index = FALSE)
dt1[v1 %in% dt2$v1]
##    v1  v2
## 1:  a dt1

Use the built in %chin% operator which both more efficient and doesn't have this bug (works only on character vectors comparison)
```
dt1[v1 %chin% dt2$v1]
##    v1  v2
## 1:  a dt1
```

Install the development version from Github (Close all your R sessions first and reopen just one)

library(devtools)
install_github("Rdatatable/data.table", build_vignettes = FALSE)
library(data.table)
dt1 <- data.table(v1 = c("a", "b", "c"), v2 = "dt1")
dt2 <- data.table(v1 = c("a", "a"), v2 = "dt2")
dt1[v1 %in% dt2$v1]
##    v1  v2
## 1:  a dt1

Use of match within i of data.table

问题:

回答1:

收藏的人(0)

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮