(This is a followup to a problem I had a few days ago, where JTidy was reporting 3 errors inside a 300k HTML document, but not reporting where. After some grinding on the problem, I found what appears to be causing the error, and I have a strong suspicion of why, but I haven't decided what to do about it yet.)
Here is a small standalone HTML expression that causes JTidy to report an error:
<html>
<body>
Some text.
<script type="text/javascript">
var foo = "Press <u>ESC</u> to continue";
</script>
</body>
</html>
The Javascript string constant contains HTML tags, and these consistently throw JTidy off - remove the underline element and JTidy finishes parsing perfectly. More accurately, JTidy's parser reports an error on the closing tag; the opening tag is fine (the output might be somewhat wrong, but it was sufficient for my later purposes). The error reports even if you comment out the string:
// Any closing tags here at all will <b>throw JTidy off</b>.
I think it's safe to say that the above is valid HTML; but I can't find any documentation on what to do about it. Searching around, I find that this has been fixed in tidy-html5; it only appears to be broken in JTidy, the Java port.
Searching a bit more, I find that I am using the latest JTidy, according to its SourceForge page; version r938 is the one in my Maven repo. (Actually, the source is unpacked in a sandbox, so that I could debug this problem.) The bug report I linked above is dated 2015; JTidy r938 came out in 2009.
Am I correct in believing JTidy is handling this incorrectly? If so, should I try to fix it, or has it been addressed in some private branch? I wouldn't call myself a parser / lexer expert, but I could muddle through if I had to.