Does anyone know of a tool that would allow me to take an XML string in Java, check it against a schema, and fix it if it is malformed?
For example, given the following schema and xml code
<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"
elementFormDefault="qualified">
<xs:element name="tag">
<xs:element name="subtag" type="xs:token" />
</xs:element>
</xs:schema>
<tag>
<subtag>content
</tag>
I am looking for a tool that can read the schema, parse the XML, notice the missing tag, and add it. For purposes of this particular program, I don't need any correction other than missing tags. (btw, a tool that can locate and add missing tags without using the schema is fine also).
Any suggestions?
The trouble is, of course, that for any instance that doesn't conform to the schema, there are an infinite number of "similar" instances that do conform to the schema, and your challenge is to choose the one that is "most similar" on some measure.
HTML5 tries to do this, with an elaborate set of rules. These rules contain a lot of knowledge of the specific schema, for example if a tr is found as a child of a table then the tr is wrapped in a tbody. You could try to do the same for your schema/vocabulary, but be prepared for a lot of work.
Doing the same thing for an arbitrary schema sounds like an interesting PhD project. Doing it successfully would probably require some research into the causes of deviations from the schema (just as spelling correction should take into account whether the input was typed by the user, obtained by voice recognition, or obtained using OCR scanning - each introduces different kinds of errors.)
Try JTidy, it will fix up malformed XML as well as HTML.