Batching in py2neo

2019-07-20 13:28发布

问题:

I have started working with Node4j and I was exploring a bit the batch processing, but unfortunately, I am having some problems in creating relations between nodes.

My problem is the following. I have a list of websites and users that I read from a file. I may have repeated websites and users in that file, so I do not want to insert new nodes for those repeated entries. But as the file is big, I want to batch the processing of the nodes and relations.

Basically, I have these two functions to create nodes and relations and add them to the batch.

graph_db = neo4j.GraphDatabaseService("http://localhost:7474/db/data/")
batch = neo4j.WriteBatch(graph_db)

def create_node(pvalue, svalue, type):
    return batch.create({\
        "pkey"  : pvalue,
        "skey"  : svalue,
        "type"  : type
        }
    )


def create_rel(from_node, type_label, to_node, fields):
    properties =\
    {"ACCT_KEY":  fields.ACCT_KEY}

    relation = rel(from_node, type_label, to_node, **properties)
    batch.create(relation)

Then, after using a dictionary to make sure I have not created the nodes before, I do:

node1 = create_node("ATTRIBUTE_1", "ATTRIBUTE_2", "WEBSITE")
node2 = create_node("ATTRIBUTE_3", "ATTRIBUTE_4", "USER")

create_rel(node1, "VISITED_BY", node2, fields)

I save the references to "node1" and "node2" in a dictionary, so when I want to create a relation involving a website or a user that has already been registered, I will not create the node again, but use directly the reference. I do this inside a loop and it works fine, till I decide to do this after a certain number of iterations:

batch.submit()
batch.clear()

When I decide to use those references from previous batches, I get the following error:

Traceback (most recent call last):
    File "main.py", line 102, in <module>
        create_rel(cardholder, fraud_label, merchant,fields)
    File "main.py", line 33, in create_rel
        batch.create(relation)
    File "/usr/local/lib/python2.7/dist-packages/py2neo/neo4j.py", line 2775, in create
        "to": self._uri_for(entity.end_node)
    File "/usr/local/lib/python2.7/dist-packages/py2neo/neo4j.py", line 2613, in _uri_for
        uri = "{{{0}}}".format(self.find(resource))
    File "/usr/local/lib/python2.7/dist-packages/py2neo/neo4j.py", line 2604, in find
        raise ValueError("Request not found")
ValueError: Request not found

I believe that this happens because it somehow loses the references from the previous batches and they are no longer valid. I have tried to collect the IDs from the nodes and use those instead, but I cannot find how to do it. Any help would be appreciated, thanks.

My Node4j version is "2.0.3 community edition for Unix" and py2neo version 1.6.4.

回答1:

Apologies if this is not clear from the documentation but references cannot extend across separate batches or batch submissions. The correct way to refer to those items previously created is to parse the results from the first submission and pass the entities required into the second.

I would generally recommend using one batch per submission and avoiding reuse of the same batch object. Future versions of py2neo will likely prevent this anyway.



标签: neo4j py2neo