I have one master and two slaves each running on 32 GB of RAM and I'm reading a csv file with around 18 million records (the first row are the headers for the columns).
This is the command I am using to run the job
./spark-submit --master yarn --deploy-mode client --executor-memory 10g <path/to/.py file>
I did the following
rdd = sc.textFile("<path/to/file>")
h = rdd.first()
header_rdd = rdd.map(lambda l: h in l)
data_rdd = rdd.subtract(header_rdd)
data_rdd.first()
I'm getting the following error message -
15/10/12 13:52:03 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: ApplicationMaster has disassociated: 192.168.1.114:51058
15/10/12 13:52:03 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: ApplicationMaster has disassociated: 192.168.1.114:51058
15/10/12 13:52:03 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkYarnAM@192.168.1.114:51058] has failed, address is now gated for [5000] ms. Reason: [Disassociated]
15/10/12 13:52:03 ERROR cluster.YarnScheduler: Lost executor 1 on hslave2: remote Rpc client disassociated
15/10/12 13:52:03 INFO scheduler.TaskSetManager: Re-queueing tasks for 1 from TaskSet 3.0
15/10/12 13:52:03 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkExecutor@hslave2:58555] has failed, address is now gated for [5000] ms. Reason: [Disassociated]
15/10/12 13:52:03 WARN scheduler.TaskSetManager: Lost task 6.6 in stage 3.0 (TID 208, hslave2): ExecutorLostFailure (executor 1 lost)
This error was coming up when the rdd.subtract() was running. Then, I modified the code and removed the rdd.subtract() and replaced it with a rdd.filter()
Modified code ->
rdd = sc.textFile("<path/to/file>")
h = rdd.first()
data_rdd = rdd.filter(lambda l: h not in l)
But I got the same error.
Does anyone know what are the reasons behind the executor getting lost?
Is it because of inadequate memory in the machines running the cluster?
I got an executor lost error because I was using the sc.wholeTextFiles() call and one of my input files was large at 149M. It caused the executor to fail. I don't think that 149M is actually very large but it caused it to fail.
This isn't a Spark bug per-se, but is probably related to the settings you have for Java, Yarn, and your Spark-config file.
see http://apache-spark-user-list.1001560.n3.nabble.com/Executor-Lost-Failure-td18486.html
You'll want to increase your Java memory, increase you akka framesize, increase the akka timeout settings, etc.
Try the following spark.conf:
You might also want to play around with how many partitions you are requesting inside you Spark program, and you may want to add some partitionBy(partitioner) statements to your RDDs, so your code might be this:
Finally, you may need to play around with your spark-submit command and add parameters for number of executors, executor memory, and driver memory