I don't want to know how to solve the problem, because I have solved it on my own. I'm just asking if it is really a bug and whether and how I should report it. You can find the code and the output below:
from html.parser import HTMLParser
class MyParser(HTMLParser):
def handle_starttag(self, tag, attrs):
for at in attrs:
if at[0] == 'href':
print(at[1])
return super().handle_starttag(tag, attrs)
def handle_data(self, data):
return super().handle_data(data)
def handle_endtag(self, tag):
return super().handle_endtag(tag)
s = '<a href="/home?ID=123>3=7">nomeLink</a>'
p = MyParser()
p.feed(s)
The following is the output:
"/home?ID=123>3=7"
No, it is not a bug. You are feeding the parser invalid HTML, the correct way to include
&
in a URL in a HTML attribute is to escape it to&
:The parser did their best (as required by the HTML standard) and gave you 'repaired' data to the best of its ability. In this case, it tried to repair another common broken-HTML error: spelling
>
as>
(forgetting the;
semicolon).Rather than build on top of the (rather low-level)
html.parser
library yourself, I recommend you use BeautifulSoup instead. BeautifulSoup supports multiple parsers, and some of those can handle broken HTML better than others.For example, the
html5lib
parser can handle unescaped ampersands in attributes better thanhtml.parser
can:For completeness sake, the third supported parser,
lxml
, also handles unescaped ampersands as if they are escaped:You could use
lxml
andhtml5lib
directly, but then you'd forgo the nice high-level API that BeautifulSoup offers.Python 3.3.2 (v3.3.2, May 16 2013, 00:03:43) [MSC v.1600 32 bit (Intel)] on win32
Let feed s = '<p a="'">' to MyHTMLParser:
This is valid html tag where ' is for '. In this case MyHTMLParser gives for attrs:
The reason of such result is the usage of unescape function:
where self.unescape is an internal helper to remove special character quoting, which is used for attributes values only. See lines 504-532 in parser.py.