As far as I understand, while using the Spark Scala interface we have to be careful not to unnecessarily serialize a full object when only one or two attributes are needed: (http://erikerlandson.github.io/blog/2015/03/31/hygienic-closures-for-scala-function-serialization/)
How does this work when using PySpark? If I have a class as follows:
class C0(object):
def func0(arg):
...
def func1(rdd):
result = rdd.map(lambda x: self.func0(x))
Does this result to pickling the full C0 instances? if yes what's the correct way to avoid it?
Thanks.
This does result in pickling of the full C0 instance, according to this documentation: http://spark.apache.org/docs/latest/programming-guide.html#passing-functions-to-spark.
In order to avoid it, do something like:
class C0(object):
def func0(self, arg): # added self
...
def func1(self, rdd): # added self
func = self.func0
result = rdd.map(lambda x: func(x))
Moral of the story: avoid the self
keyword anywhere in a map call. Spark can be smart about serializing a single function if it can calculate the function in a local closure, but any reference to self
forces spark to serialize your entire object.