-->

h2o model not fit in driver node's memory erro

2019-06-24 19:51发布

问题:

I ran GBM model through R code in H2O and got below error. The same code was running fine a couple of weeks. Wondering if this is H2O side error Or configuration on the user system?

water.exceptions.H2OModelBuilderIllegalArgumentException: Illegal argument(s) for GBM model: gbm-2017-04-18-15-29-53. Details: ERRR on field: _ntrees: The tree model will not fit in the driver node's memory (23.2 MB per tree x 1000 > 3.32 GB) - try decreasing ntrees and/or max_depth or increasing min_rows!

回答1:

The fix that worked for me was to set both the min and max memory sizes when initializing H2O. For example:

This fails when not specifying either min or max memory size:

localH2O <- h2o.init(ip='localhost', nthreads=-1)

INFO: Java heap totalMemory: 1.92 GB
INFO: Java heap maxMemory: 26.67 GB
INFO: Java version: Java 1.8.0_121 (from Oracle Corporation)
INFO: JVM launch parameters: [-ea]
INFO: OS version: Linux 3.10.0-327.el7.x86_64 (amd64)
INFO: Machine physical memory: 1.476 TB

This fails when specifying only max memory size:

localH2O <- h2o.init(ip='localhost', nthreads=-1,
                     max_mem_size='200G')

INFO: Java availableProcessors: 64
INFO: Java heap totalMemory: 1.92 GB
INFO: Java heap maxMemory: 177.78 GB
INFO: Java version: Java 1.8.0_121 (from Oracle Corporation)
INFO: JVM launch parameters: [-Xmx200G, -ea]
INFO: OS version: Linux 3.10.0-327.el7.x86_64 (amd64)
INFO: Machine physical memory: 1.476 TB

This is successful when specifying both min and max memory sizes:

localH2O <- h2o.init(ip='localhost', nthreads=-1,
                     min_mem_size='100G', max_mem_size='200G')

INFO: Java availableProcessors: 64
INFO: Java heap totalMemory: 95.83 GB
INFO: Java heap maxMemory: 177.78 GB
INFO: Java version: Java 1.8.0_121 (from Oracle Corporation)
INFO: JVM launch parameters: [-Xms100G, -Xmx200G, -ea]
INFO: OS version: Linux 3.10.0-327.el7.x86_64 (amd64)
INFO: Machine physical memory: 1.476 TB


回答2:

The 3.32 GB number in your post is a calculated number based on activity in the H2O job. So it's hard to validate it directly without knowing what happened in your job. 40 GB per node is quite different from 3.32 GB though, so do the following to sanity check the job...

After your H2O Hadoop job completes, you can take a look at the YARN logs to confirm the container is really getting the amount of memory you expect.

Use the following command (which is printed for you by the h2odriver output after the run completes):

yarn logs -applicationId application_nnn_nnn

For me, the (lightly pruned) output for one of the H2O node containers looks like this:

Container: container_e20_1487032509333_2085_01_000004 on mr-0xd4.0xdata.loc_45454
===================================================================================
LogType:stderr
Log Upload Time:Sat Apr 22 07:58:13 -0700 2017
...

LogType:stdout
Log Upload Time:Sat Apr 22 07:58:13 -0700 2017
LogLength:7517
Log Contents:
POST 0: Entered run
POST 11: After setEmbeddedH2OConfig
04-22 07:57:56.979 172.16.2.184:54323    11976  main      INFO: ----- H2O started  -----
04-22 07:57:57.011 172.16.2.184:54323    11976  main      INFO: Build git branch: rel-turing
04-22 07:57:57.011 172.16.2.184:54323    11976  main      INFO: Build git hash: 34b83da423d26dfbcc0b35c72714b31e80101d49
04-22 07:57:57.011 172.16.2.184:54323    11976  main      INFO: Build git describe: jenkins-rel-turing-8
04-22 07:57:57.011 172.16.2.184:54323    11976  main      INFO: Build project version: 3.10.0.8 (latest version: 3.10.4.5)
04-22 07:57:57.011 172.16.2.184:54323    11976  main      INFO: Build age: 6 months and 11 days
04-22 07:57:57.012 172.16.2.184:54323    11976  main      INFO: Built by: 'jenkins'
04-22 07:57:57.012 172.16.2.184:54323    11976  main      INFO: Built on: '2016-10-10 13:45:37'
04-22 07:57:57.012 172.16.2.184:54323    11976  main      INFO: Java availableProcessors: 32
04-22 07:57:57.012 172.16.2.184:54323    11976  main      INFO: Java heap totalMemory: 9.86 GB
04-22 07:57:57.012 172.16.2.184:54323    11976  main      INFO: Java heap maxMemory: 9.86 GB
04-22 07:57:57.012 172.16.2.184:54323    11976  main      INFO: Java version: Java 1.7.0_67 (from Oracle Corporation)

Note that the application master container log output looks different, so just find the output for any one of the H2O node containers.

Look for the line "Java heap maxMemory". In my case, I requested '-mapperXmx 10g' on the command line, so this looks good. 9.86 GB is close to '10g' given a little JVM overhead.

If it's not as you expect, you have a Hadoop configuration problem: some Hadoop setting is overriding the amount of memory you are requesting on the command line.



标签: model h2o gbm