I am researching Titan (on HBase) as a candidate for a large, distributed graph database. We require both OLTP access (fast, multi-hop queries over the graph) and OLAP access (loading all - or at least a large portion - of the graph into Spark for analytics).
From what I understand, I can use the Gremlin server to handle OLTP-style queries where my result-set will be small. Since my queries will be generated by a UI I can use an API to interface with the Gremlin server. So far, so good.
The problem concerns the OLAP use case. Since the data in HBase will be co-located with the Spark executors, it would be efficient to read the data into Spark using an HDFSInputFormat
. It would be inefficient (impossible, in fact, given the projected graph size) to execute a Gremlin query from the driver and then distribute the data back to the executors.
The best guidance I have found is an un-concluded discussion from the Titan GitHub repo (https://github.com/thinkaurelius/titan/issues/1045) which suggests that (at least for a Cassandra back-end) the standard TitanCassandraInputFormat
should work for reading Titan tables. Nothing is claimed about HBase backends.
However, upon reading about the underlying Titan data model (http://s3.thinkaurelius.com/docs/titan/current/data-model.html) it appears that parts the "raw" graph data is serialized, with no explanation as to how to reconstruct a property graph from the contents.
And so, I have two questions:
1) Is everything that I have stated above correct, or have I missed / misunderstood anything?
2) Has anyone managed to read a "raw" Titan graph from HBase and reconstruct it in Spark (either in GraphX or as DataFrames, RDDs etc)? If so, can you give me any pointers?