Creating a metabolic pathway in Neo4j

I am attempting to create the glycolytic pathway shown in the image at the bottom of this question, in Neo4j, using these data:

glycolysis_bioentities.csv

name
α-D-glucose
glucose 6-phosphate
fructose 6-phosphate
"fructose 1,6-bisphosphate"
dihydroxyacetone phosphate
D-glyceraldehyde 3-phosphate
"1,3-bisphosphoglycerate"
3-phosphoglycerate
2-phosphoglycerate
phosphoenolpyruvate
pyruvate
hexokinase
glucose-6-phosphatase
phosphoglucose isomerase
phosphofructokinase
"fructose-bisphosphate aldolase, class I"
triosephosphate isomerase (TIM)
glyceraldehyde-3-phosphate dehydrogenase
phosphoglycerate kinase
phosphoglycerate mutase
enolase
pyruvate kinase

glycolysis_relations.csv

source,relation,target
α-D-glucose,substrate_of,hexokinase
hexokinase,yields,glucose 6-phosphate
glucose 6-phosphate,substrate_of,glucose-6-phosphatase
glucose-6-phosphatase,yields,α-D-glucose
glucose 6-phosphate,substrate_of,phosphoglucose isomerase
phosphoglucose isomerase,yields,fructose 6-phosphate
fructose 6-phosphate,substrate_of,phosphofructokinase
phosphofructokinase,yields,"fructose 1,6-bisphosphate"
"fructose 1,6-bisphosphate",substrate_of,"fructose-bisphosphate aldolase, class I"
"fructose-bisphosphate aldolase, class I",yields,D-glyceraldehyde 3-phosphate
D-glyceraldehyde 3-phosphate,substrate_of,glyceraldehyde-3-phosphate dehydrogenase
D-glyceraldehyde 3-phosphate,substrate_of,triosephosphate isomerase (TIM)
triosephosphate isomerase (TIM),yields,dihydroxyacetone phosphate
glyceraldehyde-3-phosphate dehydrogenase,yields,"1,3-bisphosphoglycerate"
"1,3-bisphosphoglycerate",substrate_of,phosphoglycerate kinase
phosphoglycerate kinase,yields,3-phosphoglycerate
3-phosphoglycerate,substrate_of,phosphoglycerate mutase
phosphoglycerate mutase,yields,2-phosphoglycerate
2-phosphoglycerate,substrate_of,enolase
enolase,yields,phosphoenolpyruvate
phosphoenolpyruvate,substrate_of,pyruvate kinase
pyruvate kinase,yields,pyruvate

This is what I have, thus far,

... using this cypher code (passed to Cycli or cypher-shell):

LOAD CSV WITH HEADERS FROM "file:/glycolysis_relations.csv" AS row
MERGE (s:Glycolysis {source: row.source})
MERGE (r:Glycolysis {relation: row.relation})
MERGE (t:Glycolysis {target: row.target})
FOREACH (x in case row.relation when "substrate_of" then [1] else [] end |
  MERGE (s)-[r:substrate_of]->(t)
)
FOREACH (x in case row.relation when "yields" then [1] else [] end |
  MERGE (s)-[r:yields]->(t)
  );

I'd like to create the fully-connected pathway, with captions on all the nodes. Suggestions?

标签： neo4j cypher

3条回答

小情绪 Triste *

2楼-- · 2019-02-13 21:29

[UPDATED]

There are multiple issues and possible improvements:

The second MERGE should be deleted, since it creates orphaned nodes. A relationship type should not be tuned into a Glycolysis node, and such nodes would never be connected to any other nodes.
The 1st and 3rd MERGE clauses must use the same property name (say, name) for source and target nodes, or else the same chemical can end up with 2 nodes (with different property keys). This is why you ended up with nodes that did not have all the expected connections.
The APOC procedure apoc.cypher.doIt can be used to simplify somewhat the MERGE of relationships with dynamic names.
The glycolysis_bioentities.csv is not needed for this use case.

With the above changes, you end up with something like this, which will generate a connected graph that matches your input data:

LOAD CSV WITH HEADERS FROM "file:/glycolysis_relations.csv" AS row
MERGE (s:Glycolysis {name: row.source})
MERGE (t:Glycolysis {name: row.target})
WITH s, t, row
CALL apoc.cypher.doIt(
  'MERGE (s)-[r:' + row.relation + ']->(t)',
  {s:s, t:t}) YIELD value
RETURN 1;

0人赞添加讨论(0) 举报

【Aperson】

3楼-- · 2019-02-13 21:31

@cybersam's answer is excellent, providing the most elegant solution (once again: thank you!) -- please upvote that accepted answer.

Since this question/answer/topic is likely to be of interest to others, I wanted to mention that my code (based on this SO thread, How to specify relationship type in CSV?, and modified per the hints provided by @cybersam) now works, and show the result:

Solution 1 (my original post, updated):

LOAD CSV WITH HEADERS FROM "file:/glycolysis_relations.csv" AS row
MERGE (s:Glycolysis {name:row.source})
MERGE (t:Glycolysis {name:row.target})
FOREACH (x in case row.relation when "substrate_of" then [1] else [] end |
  MERGE (s)-[r:substrate_of]->(t)
)
FOREACH (x in case row.relation when "yields" then [1] else [] end |
  MERGE (s)-[r:yields]->(t)
  );

Solution 2 (cybersam's, updated):

LOAD CSV WITH HEADERS FROM "file:/glycolysis_relations.csv" AS row
MERGE (s:Metabolism:Glycolysis {name: row.source})
MERGE (t:Metabolism:Glycolysis {name: row.target})
WITH s, t, row
  // "Bug" -- additional duplicate relations with each iteration of this statement/script:
  // CALL apoc.create.relationship(s, row.relation, {}, t) YIELD rel
  // Solution: 
  // https://github.com/neo4j-contrib/neo4j-apoc-procedures/issues/271
  // https://stackoverflow.com/questions/47808421/neo4j-load-csv-to-create-dynamic-relationship-types
  CALL apoc.merge.relationship(s, row.relation, {}, {}, t) YIELD rel
RETURN COUNT(*);

Both solutions generate the identical graph, below. :-D

0人赞添加讨论(0) 举报

劳资没心，怎么记你

4楼-- · 2019-02-13 21:37

If permitted, I'd like to post one more follow-on answer -- my reason being that currently there is very little out there on recreating metabolic pathways in Neo4j, and the following will provide a complete summary under this StackOverflow title/subject, "Creating a metabolic pathway in Neo4j".

Like my Glycolysis pathway, above, I recreated in Neo4j the TCA (citric acid cycle | Kreb's cycle) pathway:

[TCA cycle image source: https://metabolicpathways.stanford.edu/]

An issue that arose during the creation of my TCA pathway graph was that the one of the nodes (the enzyme, "aconitase") was used twice, so during the graph creation MERGE merged the common node aconitase as a single entity, resulting in this layout,

... not this one, as desired,

My solution to that issue was to create the "TCA graph" using node properties, to temporarily differentially-tag the affected source and target nodes (later removing those tags, after the graph was properly created).

I also added a :Metabolism label, so that I could select the individual pathways (:Glycolysis | :TCA) or the complete metabolic network (:Metabolism), as desired.

Lastly, I needed to connect the two pathways (:Glycolysis | :TCA) through their common node, pyruvate, which I was able to do through an APOC procedure (here, appended to the end of my glycolysis.cql (Cypher) script.

Here are my CSV data files, *.cql Cypher scripts, script execution, and the resultant graph.

glycolysis.csv:

source,relation,target
α-D-glucose,substrate_of,hexokinase
hexokinase,yields,glucose 6-phosphate
glucose 6-phosphate,substrate_of,glucose-6-phosphatase
glucose-6-phosphatase,yields,α-D-glucose
glucose 6-phosphate,substrate_of,phosphoglucose isomerase
phosphoglucose isomerase,yields,fructose 6-phosphate
fructose 6-phosphate,substrate_of,phosphofructokinase
phosphofructokinase,yields,"fructose 1,6-bisphosphate"
"fructose 1,6-bisphosphate",substrate_of,"fructose-bisphosphate aldolase, class I"
"fructose-bisphosphate aldolase, class I",yields,D-glyceraldehyde 3-phosphate
D-glyceraldehyde 3-phosphate,substrate_of,glyceraldehyde-3-phosphate dehydrogenase
D-glyceraldehyde 3-phosphate,substrate_of,triosephosphate isomerase (TIM)
triosephosphate isomerase (TIM),yields,dihydroxyacetone phosphate
glyceraldehyde-3-phosphate dehydrogenase,yields,"1,3-bisphosphoglycerate"
"1,3-bisphosphoglycerate",substrate_of,phosphoglycerate kinase
phosphoglycerate kinase,yields,3-phosphoglycerate
3-phosphoglycerate,substrate_of,phosphoglycerate mutase
phosphoglycerate mutase,yields,2-phosphoglycerate
2-phosphoglycerate,substrate_of,enolase
enolase,yields,phosphoenolpyruvate
phosphoenolpyruvate,substrate_of,pyruvate kinase
pyruvate kinase,yields,pyruvate

tca.csv:

source,relation,target,tag1,tag2
pyruvate,substrate_of,pyruvate dehydrogenase,,
pyruvate dehydrogenase,yields,acetyl CoA,,
acetyl CoA,substrate_of,citrate synthase,,
oxaloacetate,substrate_of,citrate synthase,,
citrate synthase,yields,citrate,,
citrate,substrate_of,aconitase,,1
aconitase,yields,cis-aconitate,1,
cis-aconitate,substrate_of,aconitase,,2
aconitase,yields,isocitrate,2,
isocitrate,substrate_of,isocitrate dehydrogenase,,
isocitrate dehydrogenase,yields,α-ketoglutarate,,
α-ketoglutarate,substrate_of,α-ketoglutarate dehydrogenase,,
α-ketoglutarate dehydrogenase,yields,succinyl-CoA,,
succinyl-CoA,substrate_of,succinyl-CoA synthetase,,
succinyl-CoA synthetase,yields,succinate,,
succinate,substrate_of,succinate dehydrogenase,,
succinate dehydrogenase,yields,fumarate,,
fumarate,substrate_of,fumarase,,
fumarase,yields,S-malate,,
S-malate,substrate_of,malate dehydrogenase,,
malate dehydrogenase,yields,oxaloacetate,,

"tag1" and "tag"2 in "tsv.csv" are used to uniquely those source and target nodes, when they are created via the "tca.cql" script:

tca.cql:

// CREATE INDICES:
CREATE INDEX ON :Metabolism(name);
CREATE INDEX ON :TCA(name);

// CREATE GRAPH:
// USING PERIODIC COMMIT 5000
LOAD CSV WITH HEADERS FROM "file:/mnt/Vancouver/Programming/data/metabolism/tca.csv" AS row
MERGE (s:Metabolism:TCA {name: row.source, tag:COALESCE(row.tag1, '')})
MERGE (t:Metabolism:TCA {name: row.target, tag:COALESCE(row.tag2, '')})
WITH s, t, row
  CALL apoc.merge.relationship(s, row.relation, {}, {}, t) YIELD rel
  REMOVE s.tag, t.tag
RETURN COUNT(*);

glycolysis.cql:

// CREATE INDICES:
CREATE INDEX ON :Metabolism(name);
CREATE INDEX ON :Glycolysis(name);

// CREATE GRAPH:
//USING PERIODIC COMMIT 5000
LOAD CSV WITH HEADERS FROM "file:/mnt/Vancouver/Programming/data/metabolism/glycolysis.csv" AS row
MERGE (s:Metabolism:Glycolysis {name: row.source})
MERGE (t:Metabolism:Glycolysis {name: row.target})
WITH s, t, row
  CALL apoc.merge.relationship(s, row.relation, {}, {}, t) YIELD rel
RETURN COUNT(*);

// MERGE COMMON NODE (GLYCOLYSIS: PYRUVATE; TCA: PYRUVATE):
// As presented, run "tca.cql" first, then "glycolysis.cql"

MATCH (g:Glycolysis), (t:TCA) WHERE g.name = t.name
CALL apoc.refactor.mergeNodes([g,t]) YIELD node
  RETURN node;

Script execution:

$ cat tca.cql |  cypher-shell -u *** -p ***
  COUNT(*)
  21

$ cat glycolysis.cql |  cypher-shell -u *** -p ***
  COUNT(*)
  22
  node
  (:Metabolism:TCA:Glycolysis {name: "pyruvate"})

$

Neo4j graph (:Metabolism view):

0人赞添加讨论(0) 举报

Creating a metabolic pathway in Neo4j

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间