Neo4j: Enforcing schema with XSD

2019-07-18 05:31发布

问题:

I was wondering if there exists a tool for Neo4j that can read an XSD file and use it to enforce a schema on Neo4j.

I'm newbie on graph databases but I'm starting to appreciate the schema-less approach. There's a lot of projects out there that have been pumping in a lot of non-sequential data and making sense of it all which is really cool.

I've come across some requirements that call for control on what properties a node or edge can have given a certain label and what labels an edge can have given the labels of its source and destination nodes. The schema is also subject to change - although not frequent.

As I understand, the standard practice is to control the schema from the application itself which to me doesn't seem like it should be a BEST practice. For example, the picky developers from Oracle land create views for applications to interact with and then apply triggers onto the views that execute the appropriate transactions upon the application attempting to insert or update on the view.

I would be looking for a similar device in Neo4j and since I already have the XSD files, it would be a lot less work overall to simply dump them into a folder and have it use those for reference on what to enforce.

This is something I'm willing to write myself unless there's already a library out there for this. I have a day job after all. :)

Thanks!

回答1:

Not only does this tool not exist, but it couldn't even exist without more work on standardizing how XML is stored in neo4j. There are key differences between the XML model and the neo4j model.

There's this python application here that can import XML into neo4j; documents, not schemas. But in the way that it does it, there are many things to keep in mind:

  1. There's no obvious mapping from XML elements/attributes on to neo4j nodes/properties. You'd think that elements should be nodes, attributes properties; but a better graph model would usually be different than that. For example, XML namespaces would make great nodes because they connect to so many other things (e.g. all elements defined in a namespace) yet typically they're attributes. Maybe namespaces should be labels? Also maybe a reasonable choice, except there's no standard answer there.
  2. XML trees have sequence, and sequence matters; graphs don't. Say you have an XML element with 2 children, A and B. In neo4j you might have a node connected to two other nodes, but you need a way of expressing (probably via a relationship property) that A comes before B. That's of course doable in neo4j, but there's no agreement as far as I know about how to do that. So maybe you pick a sequence attribute, and give it an integer value. Seems reasonable...but now your schema validation software has a dependency on that design choice. XML in neo4j stored any other way won't validate.
  3. There's a host of XML processing options that matter in schema validation that wouldn't in a graph, for example whether or not you care about ignoring whitespace nodes, strict vs. lax schema validation, and so on.

Look, neo4j is great but if you really need to validate a pile of XML documents, it's probably not your best choice because of some mismatches between the graph model and XML's document model. Possible options might be to validate the documents before they go into neo4j, or just to come up with a way of synthesizing XML documents from what is in neo4j, and then validating that result once it's outside of the graph database, as an XML file.