reshape alternating columns in less time and using

2020-04-14 07:18发布

问题:

How can I do this reshape faster and so that it takes up less memory? My aim is to reshape a dataframe that is 500,000 rows by 500 columns with 4 Gb RAM.

Here's a function that will make some reproducible data:

make_example <- function(ndoc, ntop){
# doc numbers
V1 = seq(1:ndoc)
# filenames
V2 <- list("vector", size = ndoc)
for (i in 1:ndoc){
V2[i] <- paste(sample(c(rep(0:9,each=5),LETTERS,letters),5,replace=TRUE),collapse='')
}
# topic proportions
tvals <- data.frame(matrix(runif(1:(ndoc*ntop)), ncol = ntop))
# topic number
tnumvals <- data.frame(matrix(sample(1:ntop, size = ndoc*ntop, replace = TRUE), ncol = ntop))
# now make topic props and topic numbers alternating columns (rather slow!)
alternating <- data.frame(c(matrix(c(tnumvals, tvals), 2, byrow = T)) )
# make colnames for topic number and topic props
ntopx <- sapply(1:ntop, function(j) paste0("ntop_",j))
ptopx <- sapply(1:ntop, function(j) paste0("ptop_",j))
tops <- c(rbind(ntopx,ptopx)) 
# make data frame
dat <- data.frame(V1 = V1,
                 V2 =  unlist(V2), 
                 alternating)
names(dat) <- c("docnum", "filename", tops)
# give df as result
return(dat)
}

Make some reproducible data:

set.seed(007)
dat <- make_example(500000, 500)

Here's my current method (thanks to https://stackoverflow.com/a/8058714/1036500):

library(reshape2)
NTOPICS = (ncol(dat) - 2 )/2
nam <- c('num', 'text', paste(c('topic', 'proportion'), rep(1:NTOPICS, each = 2), sep = ""))

system.time( dat_l2 <- reshape(setNames(dat, nam), varying = 3:length(nam), direction = 'long', sep = ""))
system.time( dat.final2 <- dcast(dat_l2, dat_l2[,2] ~ dat_l2[,3], value.var = "proportion" ) )

Some timings, just for the reshape since that's the slowest step:

make_example(5000,100) = 82 sec

make_example(50000,200) = 2855 sec (crashed on attempting the second step)

make_example(500000,500) = not yet possible...

What other methods are there that are faster and less memory intensive for this reshape (data.table, this)?

回答1:

I doubt very much that this will succeed with that small amount of RAM when passing a 500000 x 500 dataframe. I wonder whether you could do even simple actions in that limited space. Buy more RAM. Furthermore, reshape2 is slow, so use stats::reshape for big stuff. And give it hints about what the separator is.

> set.seed(007)
> dat <- make_example(5, 3)
> dat
  docnum filename ntop_1     ptop_1 ntop_2    ptop_2 ntop_3    ptop_3
1      1    y8214      3 0.06564574      1 0.6799935      2 0.8470244
2      2    e6x39      2 0.62703876      1 0.2637199      3 0.4980761
3      3    34c19      3 0.49047504      3 0.1857143      3 0.7905856
4      4    1H0y6      2 0.97102441      3 0.1851432      2 0.8384639
5      5    P6zqy      3 0.36222085      3 0.3792967      3 0.4569039

> reshape(dat, direction="long", varying=3:8, sep="_")
    docnum filename time ntop       ptop id
1.1      1    y8214    1    3 0.06564574  1
2.1      2    e6x39    1    2 0.62703876  2
3.1      3    34c19    1    3 0.49047504  3
4.1      4    1H0y6    1    2 0.97102441  4
5.1      5    P6zqy    1    3 0.36222085  5
1.2      1    y8214    2    1 0.67999346  1
2.2      2    e6x39    2    1 0.26371993  2
3.2      3    34c19    2    3 0.18571426  3
4.2      4    1H0y6    2    3 0.18514322  4
5.2      5    P6zqy    2    3 0.37929675  5
1.3      1    y8214    3    2 0.84702439  1
2.3      2    e6x39    3    3 0.49807613  2
3.3      3    34c19    3    3 0.79058557  3
4.3      4    1H0y6    3    2 0.83846387  4
5.3      5    P6zqy    3    3 0.45690386  5

> system.time( dat <- make_example(5000,100) )
   user  system elapsed 
  2.925   0.131   3.043 
> system.time( dat2 <-  reshape(dat, direction="long", varying=3:202, sep="_"))
   user  system elapsed 
 16.766   8.608  25.272 

I'd say that around 1/5 of total in 32 GB memory got used during that process that was 250 times smaller than your goal, so I'm not surprised that your machine hung. (It should not have "crashed". The authors of R would prefer that you give accurate descriptions of behavior and I suspect the R process stopped responding when it paged into virtual memory.) I have performance issues that I need to work around with a dataset that is 7 million records x 100 columns when using 32 GB.