Merging and appending ffdf dataframes

2019-08-05 00:40发布

问题:

I am trying to create an ffdf dataframe by merging and appending two existing ffdf dataframes. The ffdfs have different numbers of columns and different row numbers. I know that merge() performs only inner and left outer joins while ffdfappend() will not allow appending if columns are not identical. I am wondering if anyone has a workaround for this. Either a function like the smartbind() function in the gtools package or any other workaround.

Of course converting back to as.data.frame() and using smartbind() is not an option because of the size of the ffdfs.

Any help would be greatly appreciated.

Edit: As per suggesting here is a reproducible example:

require(ff)
require(ffbase)

df1 <- data.frame(A=1:10, B=LETTERS[1:10], C=rnorm(10), G=1 )
df2 <- data.frame(A=11:20, D=rnorm(10), E=letters[1:10], G=1 )
ffdf1 <- as.ffdf(df1) 
ffdf2 <- as.ffdf(df2)

The desired result should look something like this (produced on the data.frames, if I knew how to produce it on the ffdfs I would not be asking the question):

require(gtools)
dfcombined <- smartbind(df1, df2)
dfcombined
      A    B          C G          D    E
1:1   1    A  1.1556719 1         NA <NA>
1:2   2    B  0.3279260 1         NA <NA>
1:3   3    C  0.4067643 1         NA <NA>
1:4   4    D -0.9144717 1         NA <NA>
1:5   5    E -0.1138263 1         NA <NA>
1:6   6    F  0.8227560 1         NA <NA>
1:7   7    G  0.3394098 1         NA <NA>
1:8   8    H  1.4498439 1         NA <NA>
1:9   9    I -1.3202419 1         NA <NA>
1:10 10    J  0.2099266 1         NA <NA>
2:1  11 <NA>         NA 1 -1.5802636    a
2:2  12 <NA>         NA 1  1.2925790    b
2:3  13 <NA>         NA 1  1.3477483    c
2:4  14 <NA>         NA 1 -1.6760211    d
2:5  15 <NA>         NA 1  0.1456295    e
2:6  16 <NA>         NA 1  0.4726867    f
2:7  17 <NA>         NA 1 -1.5209117    g
2:8  18 <NA>         NA 1  0.3407136    h
2:9  19 <NA>         NA 1  1.3582868    i
2:10 20 <NA>         NA 1 -1.5083929    j

I hope this makes it clearer what I try to achieve.

回答1:

If you are looking for something like rbind.fill but for ffdf objects. Maybe this is what you are looking for. This worked for me without memory issues on the test example Jan prepared.

require(ff)
require(ffbase)
smartffdfbind <- function(..., clone=TRUE){
  x <- list(...)
  columns <- lapply(x, FUN=function(x) colnames(x))
  columns <- do.call(c, columns)
  columns <- unique(columns)
  for(element in 1:length(x)){
    missingcolumns <- setdiff(columns, colnames(x[[element]]))
    for(missingcolumn in missingcolumns){
      x[[element]][[missingcolumn]] <- ff(NA, vmode = "logical", length = nrow(x[[element]]))
    }
  }
  if(clone){
    result <- clone(x[[1]][columns])
  }else{
    result <- x[[1]][columns]
  }
  for (l in tail(x, -1)) {
    result <- ffdfappend(result[columns], l[columns], recode=TRUE)
  }
  result
}

ffdf1 <- ffdf(a = ffrandom(1E8, rnorm), b = ffrandom(1E8, rnorm))
ffdf2 <- ffdf(b = ffrandom(1E8, rnorm), c = ffrandom(1E8, rnorm))

x <- smartffdfbind(ffdf1, ffdf2)
nrow(x)
[1] 200000000
class(x)
"ffdf"


回答2:

The following answer doesn't seem to work on large ffdf objects (1E8 records). After initially posting part of it as an comment, I decided to post it as an answer as the code might be a starting point for a working answer.

One trick is to first merge a small part of the two ffdf using, for example smartmatch. Then resize this object to fit ffdf1 and ffdf2. Copy ffdf1 into the first halve of this object and ffdf2 into the second halve:

require(gtools)
dfcombined <- as.ffdf(smartbind(ffdf1[1,], ffdf2[1,]))

nrow(dfcombined) <- nrow(ffdf1) + nrow(ffdf2)

# insert ffdf1 into dfcombined
cols1a <- names(dfcombined)[names(dfcombined) %in% names(ffdf1)]
cols1b <- names(dfcombined)[!(names(dfcombined) %in% names(ffdf1))]

dfcombined[ri(1, nrow(ffdf1)), cols1a] <- ffdf1
dfcombined[ri(1, nrow(ffdf1)), cols1b] <- NA

# insert ffdf2 into dfcombined
cols2a <- names(dfcombined)[names(dfcombined) %in% names(ffdf2)]
cols2b <- names(dfcombined)[!(names(dfcombined) %in% names(ffdf2))]

dfcombined[ri(nrow(ffdf1)+1, nrow(dfcombined)), cols2a] <- ffdf2
dfcombined[ri(nrow(ffdf1)+1, nrow(dfcombined)), cols2b] <- NA

However, when testing this on real sized ffdf the ncol(dfcombined) <- ... line generates an error

> ffdf1 <- ffdf(
+   a = ffrandom(1E8, rnorm),
+   b = ffrandom(1E8, rnorm)
+ )
> ffdf2 <- ffdf(
+   b = ffrandom(1E8, rnorm),
+   c = ffrandom(1E8, rnorm)
+ )
> dfcombined <- as.ffdf(smartbind(ffdf1[1,], ffdf2[1,]))
> 
> nrow(dfcombined) <- nrow(ffdf1) + nrow(ffdf2)
Error: cannot allocate vector of size 762.9 Mb