PySpark serializing the 'self' referenced

2019-07-09 09:24发布

问题:

As far as I understand, while using the Spark Scala interface we have to be careful not to unnecessarily serialize a full object when only one or two attributes are needed: (http://erikerlandson.github.io/blog/2015/03/31/hygienic-closures-for-scala-function-serialization/)

How does this work when using PySpark? If I have a class as follows:

class C0(object):

  def func0(arg):
    ...

  def func1(rdd):
    result = rdd.map(lambda x: self.func0(x))

Does this result to pickling the full C0 instances? if yes what's the correct way to avoid it?

Thanks.

回答1:

This does result in pickling of the full C0 instance, according to this documentation: http://spark.apache.org/docs/latest/programming-guide.html#passing-functions-to-spark.

In order to avoid it, do something like:

class C0(object):

  def func0(self, arg): # added self
    ...

  def func1(self, rdd): # added self
    func = self.func0
    result = rdd.map(lambda x: func(x))

Moral of the story: avoid the self keyword anywhere in a map call. Spark can be smart about serializing a single function if it can calculate the function in a local closure, but any reference to self forces spark to serialize your entire object.