I'm using JTidy v. r938. I'm using this code to attempt to clean up a page …
final Tidy tidy = new Tidy();
tidy.setQuiet(false);
tidy.setShowWarnings(true);
tidy.setShowErrors(0);
tidy.setMakeClean(true);
Document document = tidy.parseDOM(conn.getInputStream(), null);
But when I parse this URL -- http://www.chicagoreader.com/chicago/EventSearch?narrowByDate=This+Week&eventCategory=93922&keywords=&page=1, things aren't getting cleaned up. For example, the META tags on the page, like
<META http-equiv="Content-Type" content="text/html; charset=UTF-8">
remain as
<META http-equiv="Content-Type" content="text/html; charset=UTF-8">
instead of having a "</META>" tag or appearing as "<META http-equiv="Content-Type" content="text/html; charset=UTF-8"/>". I confirm this by outputting the resulting JTidy org.w3c.dom.Document as a String.
What can I do to make JTidy truly clean up the page -- i.e. make it well-formed? I realize there are other tools out there, but this question specifically relates to using JTIdy.
You need specify several flags to Tidy if you want XML format
private String cleanData(String data) throws UnsupportedEncodingException {
Tidy tidy = new Tidy();
tidy.setInputEncoding("UTF-8");
tidy.setOutputEncoding("UTF-8");
tidy.setWraplen(Integer.MAX_VALUE);
tidy.setPrintBodyOnly(true);
tidy.setXmlOut(true);
tidy.setSmartIndent(true);
ByteArrayInputStream inputStream = new ByteArrayInputStream(data.getBytes("UTF-8"));
ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
tidy.parseDOM(inputStream, outputStream);
return outputStream.toString("UTF-8");
}
Or simply if want XHTML form
Tidy tidy = new Tidy();
tidy.setXHTML(true);
use tidy.setXmlTags(true); to parse XML instead of HTML
Use Tidy.setForceOutput(true)
(at your own risk) to generate the output even if errors are found.
I parse the HTML 2 times to get well formed xml
BufferedReader br = new BufferedReader(new StringReader(str));
StringWriter sw = new StringWriter();
Tidy t = new Tidy();
t.setDropEmptyParas(true);
t.setShowWarnings(false); //to hide errors
t.setQuiet(true); //to hide warning
t.setUpperCaseAttrs(false);
t.setUpperCaseTags(false);
t.parse(br,sw);
StringBuffer sb = sw.getBuffer();
String strClean = sb.toString();
br.close();
sw.close();
//do another round of tidyness
br = new BufferedReader(new StringReader(strClean));
sw = new StringWriter();
t = new Tidy();
t.setXmlTags(true);
t.parse(br,sw);
sb = sw.getBuffer();
String strClean2 = sb.toString();
br.close();
sw.close();