Spark Streaming connection pool in each JVM

2019-04-15 15:08发布

问题:

In my spark streaming app, I have many I/O operations, such as codis, hbase, etc. I want to make sure exactly one connection pool in each executor, how can I do this elegantly? Now, I implement some static class dispersedly, this is not good for management. How about centralize them into one class like xxContext, some what like SparkContext, and need I broadcast it? I know it's good to broadcast large read-only dataset, but how about these connection pools? Java or scala are both acceptable.

回答1:

`foreachPartition` is best fit

Sample code snippet to it

val dstream = ...

dstream.foreachRDD { rdd =>

  //loop through each parttion in rdd
  rdd.foreachPartition { partitionOfRecords =>

    //1. Create Connection object/pool for Codis, HBase

    // Use it if you want record level control in rdd or partion
    partitionOfRecords.foreach { record =>
      // 2. Write each record to external client 
    }

    // 3. Batch insert if connector supports from an RDD to external source
  }

  //Use 2 or 3 to write data as per your requirement 
}

Another SO Answer for similar use case

check this: Design Patterns for using foreachRDD