Spark Streaming connection pool in each JVM

2019-04-15 15:04发布

In my spark streaming app, I have many I/O operations, such as codis, hbase, etc. I want to make sure exactly one connection pool in each executor, how can I do this elegantly? Now, I implement some static class dispersedly, this is not good for management. How about centralize them into one class like xxContext, some what like SparkContext, and need I broadcast it? I know it's good to broadcast large read-only dataset, but how about these connection pools? Java or scala are both acceptable.

标签： apache-spark spark-streaming connection-pool

1条回答

冷血范

2楼-- · 2019-04-15 15:45

`foreachPartition` is best fit

Sample code snippet to it

val dstream = ...

dstream.foreachRDD { rdd =>

  //loop through each parttion in rdd
  rdd.foreachPartition { partitionOfRecords =>

    //1. Create Connection object/pool for Codis, HBase

    // Use it if you want record level control in rdd or partion
    partitionOfRecords.foreach { record =>
      // 2. Write each record to external client 
    }

    // 3. Batch insert if connector supports from an RDD to external source
  }

  //Use 2 or 3 to write data as per your requirement 
}

Another SO Answer for similar use case

check this: Design Patterns for using foreachRDD

0人赞添加讨论(0) 举报

Spark Streaming connection pool in each JVM

foreachPartition is best fit

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间

`foreachPartition` is best fit