I have a large data called "edges"
org.apache.spark.rdd.RDD[org.apache.spark.graphx.Edge[(String, Int)]] = MappedRDD[27] at map at <console>:52
When I was working in standalone mode, I was able to collect, count and save this file. Now, on a cluster, I'm getting this error
edges.count
...
Serialized task 28:0 was 12519797 bytes which exceeds spark.akka.frameSize
(10485760 bytes). Consider using broadcast variables for large values.
Same with .saveAsTextFile("edges")
This is from the spark-shell. I have tried using the option
--driver-java-options "-Dspark.akka.frameSize=15"
But when I do that, it just hangs indefinitely. Any help would be appreciated.
** EDIT **
My standalone mode was on Spark 1.1.0 and my cluster is Spark 1.0.1.
Also, the hanging occurs when I go to count, collect or saveAs* the RDD, but defining it or doing filters on it work just fine.
The "Consider using broadcast variables for large values" error message usually indicates that you've captured some large variables in function closures. For example, you might have written something like
val someBigObject = ...
rdd.mapPartitions { x => doSomething(someBigObject, x) }.count()
which causes someBigObject
to be captured and serialized with your task. If you're doing something like that, you can use a broadcast variable instead, which will cause only a reference to the object to be stored in the task itself, while the actual object data will be sent separately.
In Spark 1.1.0+, it isn't strictly necessary to use broadcast variables for this, since tasks will automatically be broadcast (see SPARK-2521 for more details). There are still reasons to use broadcast variables (such as sharing a big object across multiple actions / jobs), but you won't need to use it to avoid frame size errors.
Another option is to increase the Akka frame size. In any Spark version, you should be able to set the spark.akka.frameSize
setting in SparkConf
prior to creating your SparkContext. As you may have noticed, though, this is a little harder in spark-shell
, where the context is created for you. In newer versions of Spark (1.1.0 and higher), you can pass --conf spark.akka.frameSize=16
when launching spark-shell
. In Spark 1.0.1 or 1.0.2, you should be able to pass --driver-java-options "-Dspark.akka.frameSize=16"
instead.