Map can not be serializable in scala?

2020-02-19 07:40发布

问题:

I am new to Scala. How come the "map" function is not serializable? How to make it serializable? For example, if my code is like below:

val data = sc.parallelize(List(1,4,3,5,2,3,5))

def myfunc(iter: Iterator[Int]) : Iterator[Int] = {
  val lst = List(("a", 1),("b", 2),("c",3), ("a",2))
  var res = List[Int]()
  while (iter.hasNext) {
    val cur = iter.next
    val a = lst.groupBy(x => x._1).mapValues(_.size)
    //val b= a.map(x => x._2)
    res = res ::: List(cur)
  }
  res.iterator
}

data.mapPartitions(myfunc).collect

If I uncomment the line

val b= a.map(x => x._2)

The code returns an exception:

org.apache.spark.SparkException: Task not serializable
Caused by: java.io.NotSerializableException: scala.collection.immutable.MapLike$$anon$2
Serialization stack:
    - object not serializable (class: scala.collection.immutable.MapLike$$anon$2, value: Map(1 -> 3))
    - field (class: $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC, name: a, type: interface scala.collection.immutable.Map)

Thank you very much.

回答1:

It's well known scala bug: https://issues.scala-lang.org/browse/SI-7005 Map#mapValues is not serializable

We have this problem in our Spark apps, map(identity) solves the problem

rdd.groupBy(_.segment).mapValues(v => ...).map(identity)


回答2:

The actual implementation of mapValues function is provided below and as you can see it is not serializable and creates only a view not a proper existence of data and hence you are getting this error. Situation wise mapValues have many advantages.

protected class MappedValues[C](f: B => C) extends AbstractMap[A, C] with DefaultMap[A, C] {
override def foreach[D](g: ((A, C)) => D): Unit = for ((k, v) <- self) g((k, f(v)))
def iterator = for ((k, v) <- self.iterator) yield (k, f(v))
override def size = self.size
override def contains(key: A) = self.contains(key)
def get(key: A) = self.get(key).map(f)
}


回答3:

Have you tried running this same code in an application? I suspect this is an issue with the spark shell. If you want to make it work in the spark shell then you might try wrapping the definition of myfunc and its application in curly braces like so:

val data = sc.parallelize(List(1,4,3,5,2,3,5))

val result = { 
  def myfunc(iter: Iterator[Int]) : Iterator[Int] = {
    val lst = List(("a", 1),("b", 2),("c",3), ("a",2))
    var res = List[Int]()
    while (iter.hasNext) {
      val cur = iter.next
      val a = lst.groupBy(x => x._1).mapValues(_.size)
      val b= a.map(x => x._2)
      res = res ::: List(cur)
    }
    res.iterator
  }
  data.mapPartitions(myfunc).collect
}