I'm trying to insert a large number of nodes (~500,000) into a (non-embedded) neo4j database by executing cypher commands using the py2neo python module (py2neo.cypher.execute). Eventually I need to remove the dependence on py2neo, but I'm using it at the moment until I learn more about cypher and neo4j.
I have two node types A and B, and the vast majority of nodes are of type A. There are two possible relationships r1 and r2, such that A-[r1]-A and A-[r2]-B. Each node of type A will have 0 - 100 r1 relationships, and each node of type B will have 1 - 5000 r2 relationships.
At the moment I am inserting nodes by building up large CREATE statements. For example I might have a statement
CREATE (:A {uid:1, attr:5})-[:r1]-(:A {uid:2, attr:5})-[:r1]-...
where ... might be another 5000 or so nodes and relationships forming a linear chain in the graph. This works okay, but it's pretty slow. I'm also indexing these nodes using
CREATE INDEX ON :A(uid)
After I've add all the type A nodes, I add the type B nodes using CREATE statements again. Finally, I am trying to add the r2 relationships using a statement like
MATCH c:B, m:A where c.uid=1 AND (m.uid=2 OR m.uid=5 OR ...)
CREATE (m)-[:r2]->(c)
where ... could represent a few thousand OR statements. This seems really slow adding only a few relationships per second.
So, is there a better way to do this? Am I completely off track here? I looked at this question but this doesn't explain how to use cypher to efficiently load the nodes. Everything else I look at seems to use java, without showing the actual cypher queries could be used.
Don't create the index until the end (in 2.0). It will slow down node creation.
Are you using parameters in your Cypher?
I imagine you're losing a lot of cypher parsing time unless your cypher is exactly the same each time with parameters. If you can model it to be that, you'll see a marked performance increase.
You're already sending fairly hefty chunks in your cypher request, but the batch request API will let you send more than one in one REST request, which might be faster (try it!).
Finally, if this is a one time import, you might consider using the batch-import tool--it can burn through 500K nodes in a few minutes even on bad hardware... then you can upgrade the database files (I don't think it can create 2.0 files yet, but that may be coming shortly if not), and create your labels/index via Cypher.
Update: I just noticed your MATCH statement at the end. You shouldn't do it this way--do one relationship at a time instead of using the OR for the ids. This will probably help a lot--and make sure you use parameters for the uids. Cypher 2.0 doesn't seem to be able to do index lookups with OR, even when you use an index hint. Maybe this will come later.
Update Dec 2013: 2.0 has the Cypher transactional endpoint, which I've seen great throughput improvements on. I've been able to send 20-30k Cypher statements/second, using "exec" sizes of 100-200 statements, and transaction sizes of 1000-10000 statements total. Very effective for speeding up loading over Cypher.