R: data.table cross-join not working

2020-01-24 09:36发布

问题:

I have two data.tables that I want to join (form a Cartesian product of). One of the data.tables is keyed on a Date vector, and the other on a numeric vector:

# data.table with dates (as numeric)
dtDates2 = data.table(date = 
                       as.numeric(seq(from = as.Date('2014/01/01'), 
                           to = as.Date('2014/07/01'), by = 'weeks')),
                     data1 = rnorm(26))

# data.table with dates
dtDates1 = data.table(date = 
                        seq(from = as.Date('2014/01/01'), 
                            to = as.Date('2014/07/01'), by = 'weeks'),
                      data1 = rnorm(26))


# data.table with customer IDs
dtCustomers = data.table(customerID = seq(1, 100),
                      data2 = rnorm(100))

I setkey and try to cross-join them using CJ:

# cross join the two datatables
setkey(dtCustomers, customerID)
setkey(dtDates1, date)
setkey(dtDates2, date)

CJ(dtCustomers, dtDates1)
CJ(dtCustomers, dtDates2)

but get the following error:

Error in FUN(X[[1L]], ...) : 
  Invalid column: it has dimensions. Can't format it. If it's the result of data.table(table()), use as.data.table(table()) instead.

Not sure what I am doing wrong.

回答1:

There is no cross join functionality available in data.table out of the box.
Yet there is CJ.dt function (a CJ like but designed for data.tables) to achieve cartesian product (cross join) available in optiRum package (available in CRAN).
You can create the function:

CJ.dt = function(X,Y) {
  stopifnot(is.data.table(X),is.data.table(Y))
  k = NULL
  X = X[, c(k=1, .SD)]
  setkey(X, k)
  Y = Y[, c(k=1, .SD)]
  setkey(Y, NULL)
  X[Y, allow.cartesian=TRUE][, k := NULL][]
}
CJ.dt(dtCustomers, dtDates1)
CJ.dt(dtCustomers, dtDates2)

Yet there is a FR for convenience way to perform cross join filled in data.table#1717, so you could check there if there is a nicer api for cross join.



回答2:

thank you jangorecki for the very useful function

I had to add support for empty X and/or Y:

CJ.dt = function(X, Y) {
  stopifnot(is.data.table(X), is.data.table(Y))

  if(nrow(X) > 0 & nrow(Y) > 0){
    k = NULL
    X = X[, c(k = 1, .SD)]
    setkey(X, k)
    Y = Y[, c(k = 1, .SD)]
    setkey(Y, NULL)
    return(X[Y, allow.cartesian = T][, k := NULL][])

  } else {
    duplicatedNames <- names(Y)[names(Y) %in% names(X)]
    if(length(duplicatedNames) > 0) {
       setnames(Y, duplicatedNames, paste0("i.", duplicatedNames))
    }
    setkey(Y)
    setkey(X)
    return(cbind(X[!X], Y[!Y]))
  }

}

# X <- data.table(a = c(1, 2))
# Y <- data.table(a = c(2, 3), b = c(4, 5))
#
# CJ.dt(X, Y)

# CJ.dt(X[a > 2], Y)