Why does the following two code snippets give different outputs? The only difference between them is that the h1
tag in the first case is replaced with an h
tag in the second case. Is this because the h1
tag has a special "meaning" in html? I tried with h1
through h6
and all of them give []
as output, while with h7
it starts to give [u'xxx']
as output.
from scrapy import Selector # scrapy version: 1.2.2
text = '<h1><p>xxx</p></h1>'
print Selector(text=text).xpath('//h1/p/text()').extract()
Output[1]: []
text = '<h><p>xxx</p></h>'
print Selector(text=text).xpath('//h/p/text()').extract()
Output[2]: [u'xxx']
Including
p
tags insideh#
is invalid according to W3C. You can see more about this hereAnyway, to bypass this and just work with any
xml
structure you can just change thetype
like this:This will respect any xml structure.
Short answer is that
h1
..h6
should not contain<p>
in well-formed HTML documents, at least lxml (which powers Scrapy Selectors) does not like that when parsing HTML. lxml does handle bad formatting, but this case it a bit different.You can check how lxml parses and serializes back the HTML snippet:
So when lxml encounters the
p
tag within theh1
, it puts it after it. Thep
element is not lost, but it's not where you'd expect it when reading the HTML source.vs the other snippet:
h
elements do not mean anything special for lxml, so "p
withinh
" is ok.