I have a json file with data of around 1.4 million nodes and I wanted to construct a Neo4j graph database for that. I tried to use py2neo's batch submit function. My code is as follows:
# the variable words is a list containing node names
from py2neo import neo4j
batch = neo4j.WriteBatch(graph_db)
nodedict = {}
# I decided to use a dictionary because I would be creating relationships
# by referring to the dictionary entries later
for i in words:
nodedict[i] = batch.create({"name":i})
results = batch.submit()
The error shown is as follows:
Traceback (most recent call last):
File "test.py", line 36, in <module>
results = batch.submit()
File "/usr/lib/python2.6/site-packages/py2neo/neo4j.py", line 2116, in submit
for response in self._submit()
File "/usr/lib/python2.6/site-packages/py2neo/neo4j.py", line 2085, in _submit
for id_, request in enumerate(self.requests)
File "/usr/lib/python2.6/site-packages/py2neo/rest.py", line 427, in _send
return self._client().send(request)
File "/usr/lib/python2.6/site-packages/py2neo/rest.py", line 364, in send
return Response(request.graph_db, rs.status, request.uri, rs.getheader("Loc$
File "/usr/lib/python2.6/site-packages/py2neo/rest.py", line 278, in __init__
raise SystemError(body)
SystemError: None
Can anybody please tell me what exactly is happening here? Does it have anything to do with the fact that the batch query is pretty large? If so, what can be done? Thanks in advance! :)
So here's what I figured out (Thanks to this question: py2neo - Neo4j - System Error - Create Batch Nodes/Relationships):
The py2neo batch submit function has it's own limitations in terms of queries that can be made. While, I wasn't able to get a exact amount on the upper limit, I tried to limit my number of queries per batch to 5000. So I decided to run the following piece of code:
# the variable words is a list containing node names
from py2neo import neo4j
batch = neo4j.WriteBatch(graph_db)
nodedict = {}
# I decided to use a dictionary because I would be creating relationships
# by referring to the dictionary entries later
for index, i in enumerate(words):
nodedict[i] = batch.create({"name":i})
if index%5000 == 0:
batch.submit()
batch = neo4j.WriteBatch(graph_db) # As stated by Nigel below, I'm creating a new batch
batch.submit() #for the final batch
This way, I sent batch requests (of size 5k queries) and was successfully able to get my entire graph created!
There's no real way to describe a limit on the number of jobs that a batch can contain - it can vary wildly based on a number of factors. The best bet in general is to experiment to find an optimum size for your use case and go with that. It looks like this is what you are already doing :-)
In terms of your solution, I'd recommend one tweak. Batch objects weren't designed to be reused so instead of clearing the batch after every submission, simply create a new one. The ability to submit a batch multiple times will be removed in the next version of py2neo anyway.
I had the same issue after I started using batch create via graph.create(*alist). The above answers pointed me in the right direction and I ended up using this snippet inspired by https://gist.github.com/anonymous/6293739 from this question py2neo - Neo4j - System Error - Create Batch Nodes/Relationships
chunk_size=500
chunks=(alist[pos:pos + chunk_size] for pos in xrange(0, len(alist), chunk_size))
for c in chunks:
graph.create(*c)
PS
py2neo==2.0.7