I would like to unmarshall some nasty HTML to a Java object using JAXB. (I'm on Java 7).
Tagsoup is a SAX-compliant XML parser that can handle nasty HTML.
How can I setup JAXB to use Tagsoup for unmarshalling HTML?
I tried setting System.setProperty("org.xml.sax.driver", "org.ccil.cowan.tagsoup.Parser");
If I create an XMLReader, it uses Tagsoup, but not when I use JAXB.
Does com.sun.xml.internal.bind.v2.runtime.JAXBContextImpl use DOM or SAX for parsing XML?
How can I tell JAXB to use SAX?
How can I tell JAXB to use TagSoup as it's SAX implementation?
As per Blaise's suggesting, tried below, but getting SAXParseException on the last line. The parse is fine when done with the XMLReader only:
JAXBContext jaxbContext = JAXBContext.newInstance(Thing.class);
Unmarshaller unmarshaller = jaxbContext.createUnmarshaller();
XMLReader xmlReader = new org.ccil.cowan.tagsoup.Parser();
xmlReader.parse("file:///c:/test.xml");
System.out.println("parse ok");
xmlReader.setContentHandler(unmarshaller.getUnmarshallerHandler());
//SAXParseException; systemId: file:/c:/test.xml; lineNumber: 5; columnNumber: 3; The element type "br" must be terminated by the matching end-tag "</br>".
Thing thing = (Thing) unmarshaller.unmarshal(new File("c:/test.xml"));