How to use JAXB with HTML?

2019-02-18 21:31发布

问题:

I would like to unmarshall some nasty HTML to a Java object using JAXB. (I'm on Java 7).

Tagsoup is a SAX-compliant XML parser that can handle nasty HTML.

How can I setup JAXB to use Tagsoup for unmarshalling HTML?

I tried setting System.setProperty("org.xml.sax.driver", "org.ccil.cowan.tagsoup.Parser");

If I create an XMLReader, it uses Tagsoup, but not when I use JAXB.

  1. Does com.sun.xml.internal.bind.v2.runtime.JAXBContextImpl use DOM or SAX for parsing XML?

  2. How can I tell JAXB to use SAX?

  3. How can I tell JAXB to use TagSoup as it's SAX implementation?

As per Blaise's suggesting, tried below, but getting SAXParseException on the last line. The parse is fine when done with the XMLReader only:

    JAXBContext jaxbContext = JAXBContext.newInstance(Thing.class);
    Unmarshaller unmarshaller = jaxbContext.createUnmarshaller();

    XMLReader xmlReader = new org.ccil.cowan.tagsoup.Parser();

    xmlReader.parse("file:///c:/test.xml");
    System.out.println("parse ok");

    xmlReader.setContentHandler(unmarshaller.getUnmarshallerHandler());

    //SAXParseException; systemId: file:/c:/test.xml; lineNumber: 5; columnNumber: 3; The element type "br" must be terminated by the matching end-tag "</br>".
    Thing thing = (Thing) unmarshaller.unmarshal(new File("c:/test.xml"));

回答1:

You can get an UnmarshallerHandler from an Unmarshaller and set that as the ContentHandler on your SAX parser. After you do the SAX parse obtain the object from the UnmarshallerHandler.

UnmarshallerHandler unmarshallerHandler = unmarshaller.getUnmarshallerHandler();
xmlReader.setContentHandler(unmarshallerHandler);
xmlReader.parse(...);
Thing thing = (Thing) unmarshallerHandler.getResult();

There is an example of this on my blog:

  • http://blog.bdoughan.com/2011/05/jaxb-and-dtd.html