I would like to be able to write new entries into HBase from a distributed (not local) Storm topology. There exist a few GitHub projects that provide either HBase Mappers or pre-made Storm bolts to write Tuples into HBase. These projects provide instructions for executing their samples on the LocalCluster.
The problem that I am running into with both of these projects, and directly accessing the HBase API from the bolt, is that they all require the HBase-site.xml file to be included on the classpath. With the direct API approach, and perhaps with the GitHub ones as well, when you execute HBaseConfiguration.create();
it will try to find the information it needs from an entry on the classpath.
How can I modify the classpath for the storm bolts to include the Hbase configuration file?
Update: Using danehammer's answer, this is how i got it working
Copy the following files into your ~/.storm directory:
- hbase-common-0.98.0.2.1.2.0-402-hadoop2.jar
- hbase-site.xml
- storm.yaml : NOTE: if you do not copy storm.yaml into that directory, then the storm jar command will NOT use that directory in the classpath (see the storm.py python script to see that logic for yourself - would be nice if this was documented)
Next, in your topology class's main method get the HBase Configuration and serialize it:
final Configuration hbaseConfig = HBaseConfiguration.create();
final DataOutputBuffer databufHbaseConfig = new DataOutputBuffer();
hbaseConfig.write(databufHbaseConfig);
final byte[] baHbaseConfigSerialized = databufHbaseConfig.getData();
Pass the byte array to your spout class through the constructor. The spout class saves this byte array to a field (Do not deserialize in the constructor. I found that if the spout has a Configuration field you will get a cannot serialize exception when running the topology)
in the spout's open method, deserialize the config and access the hbase table:
Configuration hBaseConfiguration = new Configuration();
ByteArrayInputStream bas = new ByteArrayInputStream(baHbaseConfigSerialized);
hBaseConfiguration.readFields(new DataInputStream(bas));
HTable tbl = new HTable(hBaseConfiguration, HBASE_TABLE_NAME);
Scan scan = new Scan();
scan.addColumn(Bytes.toBytes("cf"), Bytes.toBytes("YOUR_COLUMN"));
scnrTbl = tbl.getScanner(scan);
Now, in your nextTuple method you can use the Scanner to get the next row:
Result rsltWaveform = scnrWaveformTbl.next();
Extract what you want from the result, and pass those values in some serializable object to the bolts.
When you deploy a topology with the "storm jar" command, the
~/.storm
folder will be on the classpath (see this link under jar command). If you placed the hbase-site.xml file (or related *-site.xml files) in that folder,HBaseConfiguration.create()
during "storm jar" would find that file and correctly return you anorg.apache.hadoop.configuration.Configuration
. This would need to be stored and serialized within your topology in order to distribute that config around the cluster.