可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

Is there any Spark function that allows to split a collection into several RDDs according to some creteria? Such function would allow to avoid excessive itteration. For example:

def main(args: Array[String]) {
    val logFile = "file.txt" 
    val conf = new SparkConf().setAppName("Simple Application")
    val sc = new SparkContext(conf)
    val logData = sc.textFile(logFile, 2).cache()
    val lineAs = logData.filter(line => line.contains("a")).saveAsTextFile("linesA.txt")
    val lineBs = logData.filter(line => line.contains("b")).saveAsTextFile("linesB.txt")
  }

In this example I have to iterate 'logData` twice just to write results in two separate files:

    val lineAs = logData.filter(line => line.contains("a")).saveAsTextFile("linesA.txt")
    val lineBs = logData.filter(line => line.contains("b")).saveAsTextFile("linesB.txt")

It would be nice instead to have something like this:

    val resultMap = logData.map(line => if line.contains("a") ("a", line) else if line.contains("b") ("b", line) else (" - ", line)
    resultMap.writeByKey("a", "linesA.txt") 
    resultMap.writeByKey("b", "linesB.txt")

Any such thing?

回答1:

Have a look at the following question.

Write to multiple outputs by key Spark - one Spark job

You can flatMap an RDD with a function like the following and then do a groupBy on the key.

def multiFilter(words:List[String], line:String) = for { word <- words; if line.contains(word) } yield { (word,line) }
val filterWords = List("a","b")
val filteredRDD = logData.flatMap( line => multiFilter(filterWords, line) ) 
val groupedRDD = filteredRDD.groupBy(_._1)

But depending on the size of your input RDD you may or not see any performance gains because any of groupBy operations involves a shuffle.

On the other hand if you have enough memory in your Spark cluster you can cache the input RDD and therefore running multiple filter operations may not be as expensive as you think.

回答2:

Maybe something like this would work:

def singlePassMultiFilter[T](
      rdd: RDD[T],
      f1: T => Boolean,
      f2: T => Boolean,
      level: StorageLevel = StorageLevel.MEMORY_ONLY
  ): (RDD[T], RDD[T], Boolean => Unit) = {
  val tempRDD = rdd mapPartitions { iter =>
    val abuf1 = ArrayBuffer.empty[T]
    val abuf2 = ArrayBuffer.empty[T]
    for (x <- iter) {
      if (f1(x)) abuf1 += x
      if (f2(x)) abuf2 += x
    }
    Iterator.single((abuf1, abuf2))
  }
  tempRDD.persist(level)
  val rdd1 = tempRDD.flatMap(_._1)
  val rdd2 = tempRDD.flatMap(_._2)
  (rdd1, rdd2, (blocking: Boolean) => tempRDD.unpersist(blocking))
}

Note that an action called on rdd1 (resp. rdd2) will cause tempRDD to be computed and persisted. This is practically equivalent to computing rdd2 (resp. rdd1) since the overhead of the flatMap in the definitions of rdd1 and rdd2 are, I believe, going to be pretty negligible.

You would use singlePassMultiFitler like so:

val (rdd1, rdd2, cleanUp) = singlePassMultiFilter(rdd, f1, f2)
rdd1.persist()    //I'm going to need `rdd1` more later...
println(rdd1.count)  
println(rdd2.count) 
cleanUp(true)     //I'm done with `rdd2` and `rdd1` has been persisted so free stuff up...
println(rdd1.distinct.count)

Clearly this could extended to an arbitrary number of filters, collections of filters, etc.