Currently, I am trying to clean up an HTML file using JTidy, convert it to XHTML and provide the results to a DOM parser. The following code is the result of these efforts:
public class HeaderBasedNewsProvider implements INewsProvider {
/* ... */
public Collection<INewsEntry> getNewsEntries() throws NewsUnavailableException {
Document document;
try {
document = getCleanedDocument();
} catch (Exception e) {
throw new NewsUnavailableException(e);
}
System.err.println(document.getDocumentElement().getTextContent());
return null;
}
private final Document getCleanedDocument() throws IOException, SAXException, ParserConfigurationException {
InputStream input = inputStreamProvider.getInputStream();
Tidy tidy = new Tidy();
tidy.setXHTML(true);
ByteArrayOutputStream tidyOutputStream = new ByteArrayOutputStream();
tidy.parse(input, tidyOutputStream);
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setValidating(false);
InputStream domInputStream = new ByteArrayInputStream(tidyOutputStream.toByteArray());
System.err.println(factory.getClass());
return factory.newDocumentBuilder().parse(domInputStream);
}
}
However, the DOM parser implementation (com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderFactoryImpl) on my system seems to be incredibly slow. Even for one-line documents such as the following, parsing takes 2-3 minutes:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"><html><head><title></title></head><body><div class="text"><h2>Nachricht vom 16. Juni 2011</h2><h1>Titel</h1><p>Mitteilung <a href="dokumente/medienmitteilungen/MM_NR_jglp.pdf" target="_blank">weiter</a> mehr Mitteilung</p></div></body></html>
Note that - in contrast to the DOM parser - JTidy finishes its work within a second. Therefore, I suspect that I'm somehow misusing the DOM API.
Thanks in advance for any suggestions on this one!
HTML dtd's are huge, using includes. They take forever. Use an XML catalog. There one can store the dtds locally and map them by their system ID.
If you use a tool, like maven, you will find sufficient pointers.
The advantage i.o. intercepting entities as the accepted answer suggests, is that you receive the correct characters.
Even when not validating, a XML parser needs to fetch the DTD, for example to support named character entities. You should look into implementing an EntityResolver that resolves the request for the DTD to a local copy.