Scrapy - how to convert string into an object whic

2019-07-19 12:26发布

Let's say I have some plain text in HTML-like format like this:

<div id="foo"><p id="bar">Some random text</p></div>

And I need to be able to run XPath on it to retrieve some inner element. How can I convert plain text to some kind of object which I could use XPath on?

标签: xpath scrapy
3条回答
乱世女痞
2楼-- · 2019-07-19 12:39

Andersson already posted a solution to my question. This is a second one which I just discovered that works as well and that uses Scrapy's classes, making it possible to use all methods already familiar to a Scrapy user (e.g., extract(), extract_first(), etc).

text = """<div id="foo"><p id="bar">Some random text</p></div>"""
#First, we need to encode the text
text_encoded = text.encode('utf-8')
#Now, convert it to a HtmlResponse object
text_in_html = HtmlResponse(url='some url', body=text_encoded, encoding='utf-8')
#Now we can use XPath normally as if the text was a common HTML response
text_in_html.xpath(//p/text()).extract_first()
查看更多
Lonely孤独者°
3楼-- · 2019-07-19 12:45

You can just use a normal selector on which to run the same xpath, css queries directly:

from scrapy import Selector

...

sel = Selector(text="<div id="foo"><p id="bar">Some random text</p></div>")
selected_xpath = sel.xpath('//div[@id="foo"]')
查看更多
疯言疯语
4楼-- · 2019-07-19 12:46

You can pass HTML code sample as string to lxml.html and parse it with XPath:

from lxml import html

code = """<div id="foo"><p id="bar">Some random text</p></div>"""
source = html.fromstring(code)
source.xpath('//div/p/text()')
查看更多
登录 后发表回答