In my spark streaming app, I have many I/O operations, such as codis, hbase, etc. I want to make sure exactly one connection pool in each executor, how can I do this elegantly? Now, I implement some static class dispersedly, this is not good for management. How about centralize them into one class like xxContext, some what like SparkContext, and need I broadcast it? I know it's good to broadcast large read-only dataset, but how about these connection pools? Java or scala are both acceptable.
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
回答1:
foreachPartition
is best fit
Sample code snippet to it
val dstream = ...
dstream.foreachRDD { rdd =>
//loop through each parttion in rdd
rdd.foreachPartition { partitionOfRecords =>
//1. Create Connection object/pool for Codis, HBase
// Use it if you want record level control in rdd or partion
partitionOfRecords.foreach { record =>
// 2. Write each record to external client
}
// 3. Batch insert if connector supports from an RDD to external source
}
//Use 2 or 3 to write data as per your requirement
}
Another SO Answer for similar use case
check this: Design Patterns for using foreachRDD