Parsing XML with no closing tags in Java

2019-12-16 18:31发布

问题:

I am having trouble parsing an XML with no closing tag. Please see snippet of the xml below.

I have tried SAX and also StAX Parser they both need a properly formatted XML with closing tag XXYY....as you can see below the XML format is a little bit different... Please help me if there is any API out there that can help me parse this or if SAX/StAX can help me achieve what I want.... :(

<Employees>
 <Employee>
  <Detail>
    <Date>2018014
    <Name>XXYY
    <Age>0
    <LANGUAGE>ENG
    <Manager>
    <MName>YYXX
    <MID>5959
    </Manager>
    <EmployeeID>1234
  </Detail>
 </Employee>
</Employees>

回答1:

You could "fix" the XML by adding all the missing end-tags.

Any start-tag that contains text after the tag, on the same line, could be fixed by adding an end-tag at the end of the line.

The rule of "contains text" ensures that e.g. the <Manager> tag doesn't get ended, since that is actually ended 3 lines down.

Example working code:

// Load file into memory
String xml = new String(Files.readAllBytes(Paths.get("test.xml")), StandardCharsets.UTF_8);

// Apply magic to add missing end-tags
xml = xml.replaceAll("(?m)^(\\s*)<(\\w+)>([^<]+)$", "$1<$2>$3</$2>");

// Parse then print the XML, to ensure there are no errors
Document document = DocumentBuilderFactory.newInstance().newDocumentBuilder()
                                          .parse(new InputSource(new StringReader(xml)));
TransformerFactory.newInstance().newTransformer()
                  .transform(new DOMSource(document), new StreamResult(System.out));


回答2:

That appears to be SGML not XML. I've answered a newer question (for Javascript/node.js, but relevant to Java as well) detailing how to use the OpenSP SGML software to create XML from SGML.