I am trying to load millions of nodes from CSV files to Titan 1.0.0 with Cassandra backend in JAVA. How to load them?
I checked we can load them using BulkLoaderVertexProgram, but it loads the data from GraphSON format.
How do I start writing a JAVA code to bulk load the data from CSV? Can you specify some starting reference where I can look into and start writing code?
Do I have to have Spark /Hadoop running on my system to use SparkComputerGraph which is used by Bulkloaderprogram?
I am not able to start writing code, as I am not understanding how to read data from CSV using bulkloderprogram. Can you provide some starting links to proceed for Java code?
Thanks.
This was cross-posted on the Titan mailing list...
If you're looking to use Java code, check out Alex's and Matthew's Marvel graph example:
https://github.com/awslabs/dynamodb-titan-storage-backend/blob/1.0.0/src/main/java/com/amazon/titan/example/MarvelGraphFactory.java
It creates a Titan schema, parses a CSV, and then uses basic Gremlin addVertex() and addEdge() to build the graph. You'll notice that the TitanGraph isn't instantiated in the factory itself, so even though it is inside a Titan-DynamoDB example, you can use this with any Titan backend (Cassandra, HBase, Berkeley).
If your graph data is in the low millions, you could use a Titan-BerkeleyJE graph on your own machine, which might be an easier backend to use at first rather than a Cassandra cluster. I'd recommend that you do not get too caught up on loading a lot of data initially -- get comfortable with how to use Titan and TinkerPop with OLTP first and then move into OLAP approaches.
You probably need a custom Java software to read your CSV files and load the graph with them.
If you want to use OGM, meaning you need to create a POJO classes as data model for your data, you could use Peapod to create a data model easily.
So this is an example
@Vertex
public abstract class Person {
public abstract String getName();
public abstract void setName(String name);
public abstract List<Knows> getKnows();
public abstract Knows getKnows(Person person);
public abstract Knows addKnows(Person person);
public abstract Knows removeKnows(Person person);
}
@Edge
public abstract class Knows {
public abstract void setYears(int years);
public abstract int getYears();
}
To load data, this is an example,
FramedGraph g=new FramedGraph(TitanFactory.open("path_to_prop_file"));
Person person1=g.addVertex(Person.class);
person.setName("M-T-A");
Person person2=g.addVertex(Person.class);
person2.setName("Amnesiac");
Knows pKnowsP2=person.addKnows(person1);
pKnowsP2.setYears(1);
Easier than you thought? Hope so.
How about converting the csv into graphml and then loading it at once using gremlin
g = TitanFactory.open('bin/cassandra.local')
gremlin> g.loadGraphML('data/graph-of-the-gods.xml')
gremlin> g.commit()
Wouldn't that be performant than making a gremlin call for each addVertex/addEdge ?