Estimating required memory for Scala Spark job

2019-08-04 01:56发布

问题:

I'm attempting to discover how much memory will be required by Spark job.

When I run job I receive exception :

15/02/12 12:01:08 INFO rdd.HadoopRDD: Input split: file:/c:/data/example.txt:20661+20661
15/02/12 12:01:08 INFO rdd.HadoopRDD: Input split: file:/c:/data/example.txt:61983+20661
15/02/12 12:01:09 INFO rdd.HadoopRDD: Input split: file:/c:/data/example.txt:0+20661
15/02/12 12:01:09 INFO rdd.HadoopRDD: Input split: file:/c:/data/example.txt:61983+20661
15/02/12 12:01:09 INFO rdd.HadoopRDD: Input split: file:/c:/data/example.txt:41322+20661
15/02/12 12:01:09 INFO rdd.HadoopRDD: Input split: file:/c:/data/example.txt:20661+20661
15/02/12 12:01:11 ERROR executor.Executor: Exception in task 2.0 in stage 0.0 (TID 2)
java.lang.OutOfMemoryError: Java heap space

Many more messages with "15/02/12 12:01:08 INFO rdd.HadoopRDD: Input split: file:/c:/data/example.txt:20661+20661" are printed, just truncating them here for brevity.

I'm logging the computations and after approx 1'000'000 calculations I receive above exception.

The number of calculations required to finish job is 64'000'000

Currently I'm using 2GB of memory so does this mean to run this job in memory without any further code changes will require 2GB * 64 = 128GB or is this a much too simpistic method of anticipating required memory ?

How is each split file such as "15/02/12 12:01:09 INFO rdd.HadoopRDD: Input split: file:/c:/data/example.txt:20661+20661" generated ? These are not added to file system as "file:/c:/data/example.txt:20661+20661" does not exist on local machine ?

回答1:

To estimate the amount of required memory I've used this method :

use http://code.google.com/p/memory-measurer/ as described at : Calculate size of Object in Java

Once setup can use below code to estimate size of Scala collection and in turn this will provide an indication of required memory by Spark application :

object ObjectSizeDriver extends Application {

  val toMeasure = List(1,2,3,4,5,6);

  println(ObjectGraphMeasurer.measure(toMeasure));
  println(MemoryMeasurer.measureBytes(toMeasure));

}