There is a fair amount of info online about bulk loading to HBase with Spark streaming using Scala (these two were particularly useful) and some info for Java, but there seems to be a lack of info for doing it with PySpark. So my questions are:
- How can data be bulk loaded into HBase using PySpark?
- Most examples in any language only show a single column per row being upserted. How can I upsert multiple columns per row?
The code I currently have is as follows:
if __name__ == "__main__":
context = SparkContext(appName="PythonHBaseBulkLoader")
streamingContext = StreamingContext(context, 5)
stream = streamingContext.textFileStream("file:///test/input");
stream.foreachRDD(bulk_load)
streamingContext.start()
streamingContext.awaitTermination()
What I need help with is the bulk load function
def bulk_load(rdd):
#???
I've made some progress previously, with many and various errors (as documented here and here)
So after much trial and error, I present here the best I have come up with. It works well, and successfully bulk loads data (using
Put
s or HFiles) I am perfectly willing to believe that it is not the best method, so any comments/other answers are welcome. This assume you're using a CSV for your data.Bulk loading with Puts
By far the easiest way to bulk load, this simply creates a
Put
request for each cell in the CSV and queues them up to HBase.The function
csv_to_key_value
is where the magic happens:The value converter we defined earlier will convert these tuples into HBase
Put
sBulk loading with HFiles
Bulk loading with HFiles is more efficient: rather than a
Put
request for each cell, an HFile is written directly and the RegionServer is simply told to point to the new HFile. This will use Py4J, so before the Python code we have to write a small Java program:Compile this, and run it. Leave it running as long as your streaming is happening. Now update
bulk_load
as follows:Finally, the fairly straightforward
dict_to_conf
:As you can see, bulk loading with HFiles is more complex than using
Put
s, but depending on your data load it is probably worth it since once you get it working it's not that difficult.One last note on something that caught me off guard: HFiles expect the data they receive to be written in lexical order. This is not always guaranteed to be true, especially since "10" < "9". If you have designed your key to be unique, then this can be fixed easily: