Select columns of data.table based on regex

2020-02-23 06:53发布

How can I select columns of a data.table based on a regex? Consider a simple example as follows:

library(data.table)
mydt <- data.table(foo=c(1,2), bar=c(2,3), baz=c(3,4))

Is there a way to use columns of bar and baz from the datatable based on a regex? I know that the following solution works but if the table is much bigger and I would like to choose more variables this could easily get cumbersome.

mydt[, .(bar, baz)]

I would like to have something like matches() in dplyr::select() but only by reference.

5条回答
家丑人穷心不美
2楼-- · 2020-02-23 06:58

There is also a subset method for "data.table", so you can always use something like the following:

subset(mydt, select = grep("bar|baz", names(mydt)))
#    bar baz
# 1:   2   3
# 2:   3   4

It turns out that creating a startswith type of function for "data.table" is not very straightforward.

查看更多
何必那么认真
3楼-- · 2020-02-23 07:08

Since data.table v1.12.0 (Jan 2019) you can do:

mydt[, .SD, .SDcols = patterns("bar|baz")]

From the official documentation ?data.table:

[...] you can filter columns to include in .SD according to regular expressions via .SDcols=patterns(regex1, regex2, ...). The included columns will be the intersection of the columns identified by each pattern; pattern unions can easily be specified with | in a regex. You can also invert a pattern as usual with .SDcols = !patterns(...).

查看更多
干净又极端
4楼-- · 2020-02-23 07:15

You can also try to use %like% from data.table package, which is a "convenience function for calling regexpr". However makes code more readable ;)

In this case, answering your question:

mydt[, .SD, .SDcols = names(mydt) %like% "bar|baz"]

As %like% returns a logical vector, whe can use the following to get every column except those which contain "foo":

mydt[, .SD, .SDcols = ! names(mydt) %like% "foo"]

where !negates the logical vector.

查看更多
Lonely孤独者°
5楼-- · 2020-02-23 07:18

David's answer will work. But if your regex is long and you would rather it be done first, try:

cols <- grep("<regex pattern>", names(mydt), value=T)
mydt[, cols, with=FALSE]

It just depends on your preferences and needs. You can also assign the subsetted table to a chosen variable if you need the original intact.

查看更多
姐就是有狂的资本
6楼-- · 2020-02-23 07:21

UPDATE: I updated the comparison with @sindri_baldur's answer - using version 1.12.6. According to the results, patterns() is a handy shortcut, but if performance matters, one should stick with the .. or with = FALSE solution (see below).


Apparently, there is a new way of achieving this from version 1.10.2 onwards.

library(data.table)
cols <- grep("bar|baz", names(mydt), value = TRUE)
mydt[, ..cols]

It seems to work the fastest out of the posted solutions.

# Creating a large data.table with 100k rows, 32 columns
n <- 100000
foo_cols <- paste0("foo", 1:30)
big_dt <- data.table(bar = rnorm(n), baz = rnorm(n))
big_dt[, (foo_cols) := rnorm(n)]

# Methods
subsetting <- function(dt) {
    subset(dt, select = grep("bar|baz", names(dt)))
}

usingSD <- function(dt) {
    dt[, .SD, .SDcols = names(dt) %like% "bar|baz"]
}

usingWith <- function(dt) {
    cols <- grep("bar|baz", names(dt), value = TRUE)
    dt[, cols, with = FALSE]
}

usingDotDot <- function(dt) {
    cols <- grep("bar|baz", names(dt), value = TRUE)
    dt[, ..cols]
}

usingPatterns <- function(dt) {
  dt[, .SD, .SDcols = patterns("bar|baz")]
}

# Benchmark
microbenchmark(
    subsetting(big_dt), usingSD(big_dt), usingWith(big_dt), usingDotDot(big_dt),
    times = 5000
)

#Unit: microseconds
#                  expr  min   lq  mean median    uq    max neval
#    subsetting(big_dt)  430  759  1672   1309  1563  82934  5000
#       usingSD(big_dt)  547  951  1872   1461  1797  60357  5000
#     usingWith(big_dt)  278  496  1331   1112  1304  62656  5000
#   usingDotDot(big_dt)  289  483  1392   1117  1344  55878  5000
# usingPatterns(big_dt)  596 1019  1984   1518  1913 120331  5000
查看更多
登录 后发表回答