I have been using PySpark with Ipython lately on my server with 24 CPUs and 32GB RAM. Its running only on one machine. In my process, I want to collect huge amount of data as is give in below code:
train_dataRDD = (train.map(lambda x:getTagsAndText(x))
.filter(lambda x:x[-1]!=[])
.flatMap(lambda (x,text,tags): [(tag,(x,text)) for tag in tags])
.groupByKey()
.mapValues(list))
When I do
training_data = train_dataRDD.collectAsMap()
It gives me outOfMemory Error. Java heap Space
. Also, I can not perform any operations on Spark after this error as it looses connection with Java. It gives Py4JNetworkError: Cannot connect to the java server
.
It looks like heap space is small. How can I set it to bigger limits?
EDIT:
Things that I tried before running:
sc._conf.set('spark.executor.memory','32g').set('spark.driver.memory','32g').set('spark.driver.maxResultsSize','0')
I changed the spark options as per the documentation here(if you do ctrl-f and search for spark.executor.extraJavaOptions) : http://spark.apache.org/docs/1.2.1/configuration.html
It says that I can avoid OOMs by setting spark.executor.memory option. I did the same thing but it seem not be working.
After trying out loads of configuration parameters, I found that there is only one need to be changed to enable more Heap space and i.e.
spark.driver.memory
.Close your existing spark application and re run it. You will not encounter this error again. :)
I had the same problem with
pyspark
(installed withbrew
). In my case it was installed on the path/usr/local/Cellar/apache-spark
.The only configuration file I had was in
apache-spark/2.4.0/libexec/python//test_coverage/conf/spark-defaults.conf
.As suggested here I created the file
spark-defaults.conf
in the path/usr/local/Cellar/apache-spark/2.4.0/libexec/conf/spark-defaults.conf
and appended to it the linespark.driver.memory 12g
.