Convert columns of arbitrary class to the class of

2020-03-01 09:04发布

问题:

Question:

I'm working in R. I want the shared columns of 2 data.tables (shared meaning same column name) to have matching classes. I'm struggling with a way to generically convert an object of unknown class to the unknown class of another object.


More context:

I know how to set the class of a column in a data.table, and I know about the as function. Also, this question isn't entirely data.table specific, but it comes up often when I use data.tables. Further, assume that the desired coercion is possible.

I have 2 data.tables. They share some column names, and those columns are intended to represent the same information. For the column names shared by table A and table B, I want the classes of A to match those in B (or other way around).


Example data.tables:

A <- structure(list(year = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), stratum = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L)), .Names = c("year", "stratum"), row.names = c(NA, -45L), class = c("data.table", "data.frame"))

B <- structure(list(year = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3), stratum = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L), bt = c(-9.95187702337873, -9.48946944434626, -9.74178662514147, -5.36167545158338, -4.76405522202426, -5.41964239804882, -0.0807951335119085, 0.520481719699774, 0.0393874225863578, 5.40557402913123, 5.47927931969583, 5.37228402911139, 9.82774396910091, 9.89629694010177, 9.98105260936272, -9.82469892896284, -9.42530210357904, -9.66171049964775, -5.17540952901709, -4.81859082470115, -5.3577146169737, -0.0685310909609001, 0.441383303157166, -0.0105897444321987, 5.24205882775199, 5.65773605162835, 5.40217185632441, 9.90299445851434, 9.78883672575814, 9.98747998379124, -9.69843398105195, -9.31530717395811, -9.77406601252698, -4.83080164375344, -4.89056304189872, -5.3904000267275, -0.121508487954861, 0.493798577602088, -0.118550709142654, 5.23654772583187, 5.87760447006892, 5.22478092346285, 9.90949768116403, 9.85433376398086, 9.91619307289277), yr = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3)), .Names = c("year", "stratum", "bt", "yr"), row.names = c(NA, -45L), class = c("data.table", "data.frame"), sorted = c("year", "stratum"))

Here's what they look like:

> A  
    year stratum
 1:    1       1
 2:    1       2
 3:    1       3
 4:    1       4

> B
    year stratum          bt yr
 1:    1       1 -9.95187702  1
 2:    1       2 -9.48946944  1
 3:    1       3 -9.74178663  1
 4:    1       4 -5.36167545  1

Here are the classes:

> sapply(A, class)
     year   stratum 
"integer" "integer"

> sapply(B, class)
     year   stratum        bt        yr 
"numeric" "integer" "numeric" "numeric"

Manually, I can accomplish the desired task through the following:

A[,year:=as.numeric(year)]

This is easy when there's only 1 column to change, you know that column ahead of time, and you know the desired class ahead of time. If desired, it's also pretty easy to to convert arbitrary columns to a given class. I also know how to convert arbitrary columns to any given class.


My Failed Attempt:

(EDIT: This actually works; see my answer)

s2c <- function (x, type = "list") 
{
    as.call(lapply(c(type, x), as.symbol))
}

# In this case, I can assume all columns of A can be found in B
# I am also able to assume that the desired conversion is possible
B.class <- sapply(B[,eval(s2c(names(A)))], class) 
for(col in names(A)){
    set(A, j=col, value=as(A[[col]], B.class[col]))
}

But this still returns the year column as "integer", not "numeric":

> sapply(A, class)
     year   stratum 
"integer" "integer" 

The problem in the above example is that class(as(1L, "numeric")) still returns "integer". On the other hand, class(as.numeric(1L)) returns "numeric"; however, I don't know ahead of time that need as.numeric is needed.


Question, Restated:

How do I make the column classes match, when neither columns nor the to/from classes are known ahead of time?


Additional Thoughts:

In a way, the question is mostly about arbitrary class matching. I run into this issue often with data.table because it's very vocal about class matching. E.g., I run into similar problems when needed to insert NA of the appropriate type (NA_real_ vs NA_character_, etc), depending on the class of the column (see related question/ issue in This Question).

Again, this question can be seen as a general issue of converting between arbitrary classes that aren't known in advance. In the past, I've written functions using switch to do something like switch(class(x), double = as.numeric(...), character = as.character(...), ..., but that seems a big ugly. The only reason I'm bringing this up in the context of data.table is because it's where I most often encounter the need for this type of functionality.

回答1:

Not very elegant but you may 'build' the as.* call like this:

for (x in colnames(A)) { A[,x] <- eval( call( paste0("as.", class(B[,x])), A[,x]) )}


回答2:

This is one very crude way to ensure common classes:

library(magrittr)

cols = intersect(names(A), names(B))
r    = rbindlist(list(A = A, B = B[,cols,with=FALSE]), idcol = TRUE)
r[, (cols) := lapply(.SD, . %>% as.character %>% type.convert), .SDcols=cols]
B[, (cols) := r[.id=="B", cols, with=FALSE]]
A[, (cols) := r[.id=="A", cols, with=FALSE]]

sapply(A, class); sapply(B, class)
#      year   stratum 
# "integer" "integer" 
#      year   stratum        yr 
# "integer" "integer" "numeric" 

I don't like this solution:

  • I routinely use all-integer codes for IDs (like "00001", "02995"), and this would coerce those to actual integers, which is bad.
  • Who knows what this will do to fancy classes like Date or factor? This won't matter so much if you do this col-classes normalization as soon as you read data in, I suppose.

Data:

# slightly tweaked from OP
A <- setDT(structure(list(year = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), stratum = c(1L, 2L, 
3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 1L, 2L, 3L, 4L, 5L, 
6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 
9L, 10L, 11L, 12L, 13L, 14L, 15L)), .Names = c("year", "stratum"), row.names = 
c(NA, -45L), class = c("data.frame")))

B <- setDT(structure(list(year = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 
3, 3, 3, 3), stratum = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 
14L, 15L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 1L, 
2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L), yr = c(1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3)), .Names = c("year", "stratum", 
"yr"), row.names = c(NA, -45L), class = c("data.frame")))

Comment. If you have something against magrittr, use function(x) type.convert(as.character(x)) in place of the . %>% bit.



回答3:

Based on the discussion in this question, and comments in this answer, I'm thinking I may have had it right, and just landed on an odd exception.

Note that the class doesn't change, but the technicality is that it doesn't matter (for my particular use-case that prompted the question). Below I show my "failed approach", but by following through to the merge, and the classes of the columns in the merged data.table, we can see why the approach works: integers will just get promoted.

s2c <- function (x, type = "list") 
{
    as.call(lapply(c(type, x), as.symbol))
}

# In this case, I can assume all columns of A can be found in B
# I am also able to assume that the desired conversion is possible
B.class <- sapply(B[,eval(s2c(names(A)))], class)
for(col in names(A)){
    set(A, j=col, value=as(A[[col]], B.class[col]))
}

# Below here is new from what I tried in question
AB <- data.table:::merge.data.table(A, B, all=T, by=c("stratum","year"))

sapply(AB, class)
  stratum      year        bt        yr 
"integer" "numeric" "numeric" "numeric" 

Although the problem in the question isn't solved by this answer, I figured I'd post to point out that the failure to convert "integer" to "numeric" might not be a problem in many situations, so this is a straightforward, albeit circumstantial, solution.