I am trying to learn spark + scala. I want to read from HBase, but without mapreduce. I created a simple HBase table - "test" and did 3 puts in it. I want to read it via spark (without HBaseTest which uses mapreduce). I tried to run the following commands on shell
val numbers = Array(
new Get(Bytes.toBytes("row1")),
new Get(Bytes.toBytes("row2")),
new Get(Bytes.toBytes("row3")))
val conf = new HBaseConfiguration()
val table = new HTable(conf, "test")
sc.parallelize(numbers, numbers.length).map(table.get).count()
I keep getting error - org.apache.spark.SparkException: Job aborted: Task not serializable: java.io.NotSerializableException: org.apache.hadoop.hbase.HBaseConfiguration
Can someone help me , how can I create a Htable which uses serialzable configuration
thanks
Your problem is that
table
is not serializable (rather it's memberconf
) and your trying to serialize it by using it inside amap
. They way your trying to read HBase isn't quite correct, it looks like your trying some specific Get's and then trying to do them in parallel. Even if you did get this working, this really wouldn't scale as your going to perform random reads. What you want to do is perform a table scan using Spark, here is a code snippet that should help you do it:This will give you an RDD containing the NaviagableMap's that constitute the rows. Below is how you can change the NaviagbleMap to a normal Scala map of Strings:
Final point, if you really do want to try to perform random reads in parallel I believe you might be able to put the HBase table initialization inside the
map
.what happens when you do
@transient val conf = new HBaseConfiguration
UPDATE Apparently there are other parts of the HBase submitted task that are also not serializable. Each of these will need to be addressed.
Consider whether the entity will have the same meaning/semantics on both sides of the wire. Any connections will certainly not. The HBaseConfiguration should not be serialized. But primitives and simple objects built atop primitives - and not containing context-sensitive data - are fine to include in the serialization
For context-sensitive entities - including the HBaseConfiguration and any connection oriented data structures - you should mark them @transient and then in the readObject() method they should be instantiated with values relevant to the client environment.