I’m using spark-streaming python read kafka and write to hbase, I found the job on stage of saveAsNewAPIHadoopDataset very easily get blocked. As the below picture: You will find the duration is 8 hours on this stage. Does the spark write data by Hbase api or directly write the data via HDFS api please?
相关问题
- How to maintain order of key-value in DataFrame sa
- Spark on Yarn Container Failure
- In Spark Streaming how to process old data and del
- Filter from Cassandra table by RDD values
- Spark 2.1 cannot write Vector field on CSV
相关文章
- Livy Server: return a dataframe as JSON?
- SQL query Frequency Distribution matrix for produc
- How to filter rows for a specific aggregate with s
- How to name file when saveAsTextFile in spark?
- hbase-client 2.0.x error
- Spark save(write) parquet only one file
- Could you give me any clue Why 'Cannot call me
- Why does the Spark DataFrame conversion to RDD req
A bit late , but here is a similar example To save an RDD to hbase :
Consider an RDD containing a single line :
Transform the RDD
We neet to transform the RDD into a (key,value) pair having the following contents:
( rowkey , [ row key , column family , column name , value ] )
Save to HBase
We can make use of the
RDD.saveAsNewAPIHadoopDataset
function as used in this example: PySpark Hbase example to save the RDD to HBase ?You can refer to my blog :pyspark-sparkstreaming hbase for the complete code of the working example.