Can I programmatically update the type of a set of

2019-06-23 23:05发布

I would like to modify a set of columns inside a data.table to be factors. If I knew the names of the columns in advance, I think this would be straightforward.

library(data.table)
dt1  <- data.table(a = (1:4), b = rep(c('a','b')), c = rep(c(0,1)))
dt1[,class(b)]
dt1[,b:=factor(b)]
dt1[,class(b)]

But I don't, and instead have a list of the variable names

vars.factors  <- c('b','c')

I can apply the factor function to them without a problem ...

lapply(vars.factors, function(x) dt1[,class(get(x))])
lapply(vars.factors, function(x) dt1[,factor(get(x))])
lapply(vars.factors, function(x) dt1[,factor(get(x))])

But I don't know how to re-assign or update the original column in the data table.

This fails ...

  lapply(vars.factors, function(x) dt1[,x:=factor(get(x))])
  # Error in get(x) : invalid first argument 

As does this ...

  lapply(vars.factors, function(x) dt1[,get(x):=factor(get(x))])
  # Error in get(x) : object 'b' not found 

NB. I tried the answer proposed here without any luck.

2条回答
老娘就宠你
2楼-- · 2019-06-23 23:37

Using data frame:

> df1 = data.frame(dt1)
> df1[,vars.factors] = data.frame(sapply(df1[,vars.factors], factor))
> dt1 = data.table(df1)

> dt1
   a b c
1: 1 1 b
2: 2 2 c
3: 3 3 b
4: 4 4 c

> str(dt1)
Classes ‘data.table’ and 'data.frame':  4 obs. of  3 variables:
 $ a: int  1 2 3 4
 $ b: Factor w/ 4 levels "1","2","3","4": 1 2 3 4
 $ c: Factor w/ 2 levels "b","c": 1 2 1 2
 - attr(*, ".internal.selfref")=<externalptr> 
查看更多
时光不老,我们不散
3楼-- · 2019-06-23 23:53

Yes, this is fairly straightforward:

dt1[, (vars.factors) := lapply(.SD, as.factor), .SDcols=vars.factors]

In the LHS (of := in j), we specify the names of the columns. If a column already exists, it'll be updated, else, a new column will be created. In the RHS, we loop over all the columns in .SD (which stands for Subset of Data), and we specify the columns that should be in .SD with the .SDcols argument.

Following up on comment:

Note that we need to wrap LHS with () for it to be evaluated and fetch the column names within vars.factors variable. This is because we allow the syntax

DT[, col := value]

when there's only one column to assign, by specifying the column name as a symbol (without quotes), purely for convenience. This creates a column named col and assigns value to it.

To distinguish these two cases apart, we need the (). Wrapping it with () is sufficient to identify that we really need to get the values within the variable.

查看更多
登录 后发表回答