How to get data into h2o fast

2020-07-10 06:20发布

问题:

What my question isnt:

  • Efficient way to maintain a h2o data frame
  • H2O running slower than data.table R
  • Loading data bigger than the memory size in h2o

Hardware/Space:

  • 32 Xeon threads w/ ~256 GB Ram
  • ~65 GB of data to upload. (about 5.6 billion cells)

Problem:
It is taking hours to upload my data into h2o. This isn't any special processing, only "as.h2o(...)".

It takes less than a minute using "fread" to get the text into the space and then I make a few row/col transformations (diff's, lags) and try to import.

The total R memory is ~56GB before trying any sort of "as.h2o" so the 128 allocated shouldn't be too crazy, should it?

Question:
What can I do to make this take less than an hour to load into h2o? It should take from a minute to a few minutes, no longer.

What I have tried:

  • bumping ram up to 128 GB in 'h2o.init'
  • using slam, data.table, and options( ...
  • convert to "as.data.frame" before "as.h2o"
  • write to csv file (r write.csv chokes and takes forever. It is writing a lot of GB though, so I understand).
  • write to sqlite3, too many columns for a table, which is weird.
  • Checked drive cache/swap to make sure there are enough GB there. Perhaps java is using cache. (still working)

Update:
So it looks like my only option is to make a giant text file and then use "h2o.importFile(...)" for it. I'm up to 15GB written.

Update2:
It is a hideous csv file, at ~22GB (~2.4Mrows, ~2300 cols). For what it was worth, it took from 12:53pm until 2:44PM to write the csv file. Importing it was substantially faster, after it was written.

回答1:

Think of as.h2o() as a convenience function, that does these steps:

  1. converts your R data to a data.frame, if not already one.
  2. saves that data.frame to a temp file on local disk (it will use data.table::fwrite() if available (*), otherwise write.csv())
  3. call h2o.uploadFile() on that temp file
  4. delete the temp file

As your updates say, writing huge data files to disk can take a while. But the other pain point here is using h2o.uploadFile() instead of the quicker h2o.importFile(). The decision of which to use is visibility:

  • With h2o.uploadFile() your client has to be able to see the file.
  • With h2o.importFile() your cluster has to be able to see the file.

When your client is running on the same machine as one of your cluster nodes, your data file is visible to both client and cluster, so always prefer h2o.importFile(). (It does a multi-threaded import.)

Another couple of tips: only bring data into the R session that you actually need there. And remember both R and H2O are column-oriented, so cbind can be quick. If you just need to process 100 of your 2300 columns in R, have them in one csv file, and keep the other 2200 columns in another csv file. Then h2o.cbind() them after loading each into H2O.

*: Use h2o:::as.h2o.data.frame (without parentheses) to see the actual code. For data.table writing you need to first do options(h2o.use.data.table = TRUE); you can also optionally switch it on/off with the h2o.fwrite option.