Thanks to the Cloudera distribution, I have a HBase master/datanode + Thrift server running on a local machine, and can code and test HBase client programs and use it, no problem.
However, I now need to use Thrift in production, and I'm not able to find documentation on how to get Thrift running with a production HBase cluster.
From what I understand, I will need to run the hbase-thrift program on the client node since the Thrift program is just another intermediate client to HBase.
So I'm guessing that I have to be able to somehow specify the master node hostname/IP to HBase-Thrift? How would I do this?
Also, any suggestions on how to scale this up in production? Do I only need a setup like this:
Client <-> Thrift client <-> HBase Master <-> Multiple HBase workers
Get it running
You don't have to run a Thrift server on your local machine, it can run anywhere but the RegionServers are usually a good place*. In the code you then connect to that server.
A Python example:
Where you'd obviously replace the
random-regionserver
with one of the servers you're running the Thrift server on.That server gets its configuration from the usual places. If you're using CDH then you'll find the configuration in
/etc/hbase/conf/hbase-site.xml
and you'll need to add a propertyhbase.zookeeper.quorum
:When you start the Thrift server from the downloaded Apache distribution this is similar except that the
hbase-site.xml
will probably sit in a different directory.Scaling it up
One easy way to scale up right now is to keep a list of all the Regionservers in your Thrift client and pick one at random on connect. Or you create multiple connections and use a random one each time. Some language bindings (i.e. PHP) have a
TSocketPool
where you can pass in all your servers. Otherwise there's some manual work you need to do.Using this technique all reads and writes should be more or less distributed across the Thrift servers in your cluster. Each read or write operation arriving at a Thrift server will still be translated into a Java based API call from the Thrift server which then opens a network connection to the proper Regionserver(s) to perform the requested action.
That means that you won't get as good a performance as you would when you use the Java API. It might help if you cache region locations yourself and hit the appropriate Thrift server but even then an additional Java API call will be made even if it ends up on the local server. HBASE-4460 would help with this scenario but this is not included in CDH3u4 or CDH4.
* There is an issue HBASE-4460 which actually embeds a Thrift server in a Regionserver.