What my question isnt:
- Efficient way to maintain a h2o data frame
- H2O running slower than data.table R
- Loading data bigger than the memory size in h2o
Hardware/Space:
- 32 Xeon threads w/ ~256 GB Ram
- ~65 GB of data to upload. (about 5.6 billion cells)
Problem:
It is taking hours to upload my data into h2o. This isn't any special processing, only "as.h2o(...)".
It takes less than a minute using "fread" to get the text into the space and then I make a few row/col transformations (diff's, lags) and try to import.
The total R memory is ~56GB before trying any sort of "as.h2o" so the 128 allocated shouldn't be too crazy, should it?
Question:
What can I do to make this take less than an hour to load into h2o? It should take from a minute to a few minutes, no longer.
What I have tried:
- bumping ram up to 128 GB in 'h2o.init'
- using slam, data.table, and options( ...
- convert to "as.data.frame" before "as.h2o"
- write to csv file (r write.csv chokes and takes forever. It is writing a lot of GB though, so I understand).
- write to sqlite3, too many columns for a table, which is weird.
- Checked drive cache/swap to make sure there are enough GB there. Perhaps java is using cache. (still working)
Update:
So it looks like my only option is to make a giant text file and then use "h2o.importFile(...)" for it. I'm up to 15GB written.
Update2:
It is a hideous csv file, at ~22GB (~2.4Mrows, ~2300 cols). For what it was worth, it took from 12:53pm until 2:44PM to write the csv file. Importing it was substantially faster, after it was written.