Reading a large graph from Titan (on HBase) into S

I am researching Titan (on HBase) as a candidate for a large, distributed graph database. We require both OLTP access (fast, multi-hop queries over the graph) and OLAP access (loading all - or at least a large portion - of the graph into Spark for analytics).

From what I understand, I can use the Gremlin server to handle OLTP-style queries where my result-set will be small. Since my queries will be generated by a UI I can use an API to interface with the Gremlin server. So far, so good.

The problem concerns the OLAP use case. Since the data in HBase will be co-located with the Spark executors, it would be efficient to read the data into Spark using an HDFSInputFormat. It would be inefficient (impossible, in fact, given the projected graph size) to execute a Gremlin query from the driver and then distribute the data back to the executors.

The best guidance I have found is an un-concluded discussion from the Titan GitHub repo (https://github.com/thinkaurelius/titan/issues/1045) which suggests that (at least for a Cassandra back-end) the standard TitanCassandraInputFormat should work for reading Titan tables. Nothing is claimed about HBase backends.

However, upon reading about the underlying Titan data model (http://s3.thinkaurelius.com/docs/titan/current/data-model.html) it appears that parts the "raw" graph data is serialized, with no explanation as to how to reconstruct a property graph from the contents.

And so, I have two questions:

1) Is everything that I have stated above correct, or have I missed / misunderstood anything?

2) Has anyone managed to read a "raw" Titan graph from HBase and reconstruct it in Spark (either in GraphX or as DataFrames, RDDs etc)? If so, can you give me any pointers?

About a year ago, I encountered the same challenge as you describe -- we had a very large Titan instance, but we could not run any OLAP processes on it.

I have researched the subject pretty deeply, but any solution I found (SparkGraphComputer, TitanHBaseInputFormat) was either very slow (matters of days or weeks in our scale) or just buggy and missed data. The main reason for the slowness was that all of them used HBase main API, which turned out as the speed bottleneck.

So I implemented Mizo - it is a Spark RDD for Titan on HBase, that bypasses HBase main API, and parses HBase internal data files (called HFiles).

I have tested it on a pretty large scale -- a Titan graph with hundreds of billions of elements, weighing about 25TB.

Because it does not rely on the Scan API that HBase exposes, it is much faster. For example, counting edges in the graph I mentioned takes about 10 hours.