How to load millions of vertices from CSV into Tit

2020-04-20 05:35发布

I am trying to load millions of nodes from CSV files to Titan 1.0.0 with Cassandra backend in JAVA. How to load them?

I checked we can load them using BulkLoaderVertexProgram, but it loads the data from GraphSON format.

How do I start writing a JAVA code to bulk load the data from CSV? Can you specify some starting reference where I can look into and start writing code?

Do I have to have Spark /Hadoop running on my system to use SparkComputerGraph which is used by Bulkloaderprogram?

I am not able to start writing code, as I am not understanding how to read data from CSV using bulkloderprogram. Can you provide some starting links to proceed for Java code?

Thanks.

3条回答
2楼-- · 2020-04-20 06:11

You probably need a custom Java software to read your CSV files and load the graph with them.

If you want to use OGM, meaning you need to create a POJO classes as data model for your data, you could use Peapod to create a data model easily.

So this is an example

@Vertex
public abstract class Person {
  public abstract String getName();
  public abstract void setName(String name);

  public abstract List<Knows> getKnows();
  public abstract Knows getKnows(Person person);
  public abstract Knows addKnows(Person person);
  public abstract Knows removeKnows(Person person);
}

@Edge
public abstract class Knows {
  public abstract void setYears(int years);
  public abstract int getYears();
}

To load data, this is an example,

FramedGraph g=new FramedGraph(TitanFactory.open("path_to_prop_file"));
Person person1=g.addVertex(Person.class);
person.setName("M-T-A");

Person person2=g.addVertex(Person.class);
person2.setName("Amnesiac");

Knows pKnowsP2=person.addKnows(person1);
pKnowsP2.setYears(1);

Easier than you thought? Hope so.

查看更多
叛逆
3楼-- · 2020-04-20 06:12

How about converting the csv into graphml and then loading it at once using gremlin

g = TitanFactory.open('bin/cassandra.local')  
gremlin> g.loadGraphML('data/graph-of-the-gods.xml')
gremlin> g.commit()

Wouldn't that be performant than making a gremlin call for each addVertex/addEdge ?

查看更多
爷的心禁止访问
4楼-- · 2020-04-20 06:29

This was cross-posted on the Titan mailing list...

If you're looking to use Java code, check out Alex's and Matthew's Marvel graph example:

https://github.com/awslabs/dynamodb-titan-storage-backend/blob/1.0.0/src/main/java/com/amazon/titan/example/MarvelGraphFactory.java

It creates a Titan schema, parses a CSV, and then uses basic Gremlin addVertex() and addEdge() to build the graph. You'll notice that the TitanGraph isn't instantiated in the factory itself, so even though it is inside a Titan-DynamoDB example, you can use this with any Titan backend (Cassandra, HBase, Berkeley).

If your graph data is in the low millions, you could use a Titan-BerkeleyJE graph on your own machine, which might be an easier backend to use at first rather than a Cassandra cluster. I'd recommend that you do not get too caught up on loading a lot of data initially -- get comfortable with how to use Titan and TinkerPop with OLTP first and then move into OLAP approaches.

查看更多
登录 后发表回答