I have a graph with approximately nine million nodes, and twelve million relationships. For each of the nodes within the graph there is a subset of properties for each respective node which forms a unique identity for the node, by Label. The graph is being updated by various data sources which augment existing nodes within the graph, or create new nodes if the nodes don't exist. I don't want to create duplicates according to the unique set of properties within the graph during the update.
For example, I have People in the graph, and their uniqueness is determined by their first name and last name. The following code is to create two distinct people:
CREATE (p:Person{first:"barry",last:"smith",height:187});
CREATE (p:Person{first:"fred",last:"jones",language:"welsh"});
Later, from one of the data sources I receive the following data records (one per line):
first: "fred", last: "lake", height: 201
first: "barry", last: "smith", language: "english"
first: "fred", last: "jones", language: "welsh", height: 188
first: "fred", last: "jones", eyes: "brown"
first: "barry", last: "smith"
After updating the graph I want to have the following nodes:
(:Person{first:"fred",last:"jones",language:"welsh",height:"188,eyes:"brown"})
(:Person{first:"barry",last:"smith",language"english",height:187})
(:Person{first:"fred",last"lake",height:201})
I'm trying to formulate a MERGE
query which can do this kind of update. I have come up with the following approach:
- Start with a
MERGE
that uses the uniqueness properties (first
andlast
from the example) to find or create the initial node; - Then do a
SET
containing each property defined in the incoming record.
So, for the three examples records given above:
MERGE (p:Person{first:"fred",last:"lake"}) SET p.height = 201;
MERGE (p:Person{first:"barry",last:"smith"}) SET p.language = "english";
MERGE (p:Person{first:"fred",last:"jones"}) SET p.language = "welsh", p.height = 188;
MERGE (p:Person{first:"fred",last:"jones"}) SET p.eyes = "brown";
MERGE (p:Person{first:"barry",last:"smith"});
I've tried this out and it works, but I'm curious to know whether this is the best way (most efficient...) to ensure uniqueness in the nodes based on a set of properties, and allow additional information to be added (or not) as updates come in over time?