Behavior of the scrapy xpath selector on h1-h6 tag

Why does the following two code snippets give different outputs? The only difference between them is that the h1 tag in the first case is replaced with an h tag in the second case. Is this because the h1 tag has a special "meaning" in html? I tried with h1 through h6 and all of them give [] as output, while with h7 it starts to give [u'xxx'] as output.

from scrapy import Selector # scrapy version: 1.2.2

text = '<h1><p>xxx</p></h1>'
print Selector(text=text).xpath('//h1/p/text()').extract()
Output[1]: []

text = '<h><p>xxx</p></h>'
print Selector(text=text).xpath('//h/p/text()').extract()
Output[2]: [u'xxx']

标签： python html xpath scrapy selector

2条回答

家丑人穷心不美

2楼-- · 2020-03-31 06:23

Including p tags inside h# is invalid according to W3C. You can see more about this here

Anyway, to bypass this and just work with any xml structure you can just change the type like this:

sel = Selector(text="anyxml", type="xml")

This will respect any xml structure.

0人赞添加讨论(0) 举报

Anthone

3楼-- · 2020-03-31 06:25

Short answer is that h1..h6 should not contain <p> in well-formed HTML documents, at least lxml (which powers Scrapy Selectors) does not like that when parsing HTML. lxml does handle bad formatting, but this case it a bit different.

You can check how lxml parses and serializes back the HTML snippet:

>>> from scrapy import Selector
>>> text = '<h1><p>xxx</p></h1>'
>>> s = Selector(text=text)
>>> print(s.extract())
<html><body><h1></h1><p>xxx</p></body></html>

So when lxml encounters the p tag within the h1, it puts it after it. The p element is not lost, but it's not where you'd expect it when reading the HTML source.

vs the other snippet:

>>> text = '<h><p>xxx</p></h>'
>>> s = Selector(text=text)
>>> print(s.extract())
<html><body><h><p>xxx</p></h></body></html>
>>>

h elements do not mean anything special for lxml, so "p within h" is ok.

0人赞添加讨论(0) 举报

Behavior of the scrapy xpath selector on h1-h6 tag

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间