spark streaming write data to Hbase with python bl

2019-06-09 16:20发布

I’m using spark-streaming python read kafka and write to hbase, I found the job on stage of saveAsNewAPIHadoopDataset very easily get blocked. As the below picture: You will find the duration is 8 hours on this stage. Does the spark write data by Hbase api or directly write the data via HDFS api please? enter image description here

标签： apache-spark hbase spark-streaming

1条回答

等我变得足够好

2楼-- · 2019-06-09 17:12

A bit late , but here is a similar example To save an RDD to hbase :

Consider an RDD containing a single line :

{"id":3,"name":"Moony","color":"grey","description":"Monochrome kitty"}

Transform the RDD
We neet to transform the RDD into a (key,value) pair having the following contents:

( rowkey , [ row key , column family , column name , value ] )

datamap = rdd.map(lambda x: (str(json.loads(x)["id"]),[str(json.loads(x)["id"]),"cfamily","cats_json",x]))

Save to HBase
We can make use of the RDD.saveAsNewAPIHadoopDataset function as used in this example: PySpark Hbase example to save the RDD to HBase ?

datamap.saveAsNewAPIHadoopDataset(conf=conf,keyConverter=keyConv,valueConverter=valueConv)

You can refer to my blog :pyspark-sparkstreaming hbase for the complete code of the working example.

0人赞添加讨论(0) 举报

spark streaming write data to Hbase with python bl

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间