I'm attempting to discover how much memory will be required by Spark job.
When I run job I receive exception :
15/02/12 12:01:08 INFO rdd.HadoopRDD: Input split: file:/c:/data/example.txt:20661+20661
15/02/12 12:01:08 INFO rdd.HadoopRDD: Input split: file:/c:/data/example.txt:61983+20661
15/02/12 12:01:09 INFO rdd.HadoopRDD: Input split: file:/c:/data/example.txt:0+20661
15/02/12 12:01:09 INFO rdd.HadoopRDD: Input split: file:/c:/data/example.txt:61983+20661
15/02/12 12:01:09 INFO rdd.HadoopRDD: Input split: file:/c:/data/example.txt:41322+20661
15/02/12 12:01:09 INFO rdd.HadoopRDD: Input split: file:/c:/data/example.txt:20661+20661
15/02/12 12:01:11 ERROR executor.Executor: Exception in task 2.0 in stage 0.0 (TID 2)
java.lang.OutOfMemoryError: Java heap space
Many more messages with "15/02/12 12:01:08 INFO rdd.HadoopRDD: Input split: file:/c:/data/example.txt:20661+20661" are printed, just truncating them here for brevity.
I'm logging the computations and after approx 1'000'000 calculations I receive above exception.
The number of calculations required to finish job is 64'000'000
Currently I'm using 2GB of memory so does this mean to run this job in memory without any further code changes will require 2GB * 64 = 128GB or is this a much too simpistic method of anticipating required memory ?
How is each split file such as "15/02/12 12:01:09 INFO rdd.HadoopRDD: Input split: file:/c:/data/example.txt:20661+20661" generated ? These are not added to file system as "file:/c:/data/example.txt:20661+20661" does not exist on local machine ?