As far as I understand, while using the Spark Scala interface we have to be careful not to unnecessarily serialize a full object when only one or two attributes are needed: (http://erikerlandson.github.io/blog/2015/03/31/hygienic-closures-for-scala-function-serialization/)
How does this work when using PySpark? If I have a class as follows:
class C0(object):
def func0(arg):
...
def func1(rdd):
result = rdd.map(lambda x: self.func0(x))
Does this result to pickling the full C0 instances? if yes what's the correct way to avoid it?
Thanks.
This does result in pickling of the full C0 instance, according to this documentation: http://spark.apache.org/docs/latest/programming-guide.html#passing-functions-to-spark.
In order to avoid it, do something like:
Moral of the story: avoid the
self
keyword anywhere in a map call. Spark can be smart about serializing a single function if it can calculate the function in a local closure, but any reference toself
forces spark to serialize your entire object.