I have an HTML page that is generated by an existing tool - I cannot change the output of this tool.
However, I want to use xmllint
with the --xpath
option to pick out a few specific pieces of information from the downloaded webpage. The problem is that the page starts with:
<html lang=en><head>...
And xmllint
throws errors nearly immediately:
html.out:2: parser error : AttValue: " or ' expected
<html lang=en><head>
^
The issue certainly seems to be the missing enclosing quotation marks around the value of the lang
attribute. The entire page is full of this kind of issue. (Though only sporadically.)
Nearly every browser can parse this just fine - how can I convince xmllint
to do so as well? I would like to avoid having to inject an intermediate step to "fix" the file. Instead, I would like to either:
1) Find a flag, validation option, etc. that helps the parser along, or:
2) Use some other tool. (But what? xmllint
is always my go-to for command line XPath commands.)
Further, using just xpath
results in:
> xpath html.out '//myquery...'
not well-formed (invalid token) at line 2, column 11, ...