I am new to scrapy. I want to crawl some data from the web. I got the html document like below.
dom style1:
<div class="user-info">
<p class="user-name">
something in p tag
</p>
text data I want
</div>
dom style2:
<div class="user-info">
<div>
<p class="user-img">
something in p tag
</p>
something in div tag
</div>
<div>
<p class="user-name">
something in p tag
</p>
text data I want
</div>
</div>
I want to get the data text data I want, now I can use css or xpath selector to get it by check it exists. But I want to know some better ways.
For example, I can get css p.user-name
first, and then I get it's parent, and then I get it's div/text()
, and always the data I want is the text()
of the p.user-name
's immediate parent div
, but the question is, how can I get the immediate parent p.user-name
?
With xpath you can traverse the xml tree in every direction(parent, sibling, child etc.) where css doesn't support this.
For your case you can get node's parent with xpath ..
parent notation:
//p[@class='user-name']/../text()
Explanation:
//p[@class='user-name']
- find <p>
nodes with class value user-name
.
/..
- select node's parent.
/text()
- select text of the current node.
This xpath should work in both of your described cases.
What about using following-sibling
axis?
>>> s = scrapy.Selector(text='''<div class="user-info">
... <p class="user-name">
... something in p tag
... </p>
... text data I want
... </div>''')
>>> username = s.css('p.user-name')[0]
>>> username.xpath('following-sibling::text()[1]').get()
'\n text data I want\n'
>>>
>>> s2 = scrapy.Selector(text='''<div class="user-info">
... <div>
... <p class="user-img">
... something in p tag
... </p>
... something in div tag
... </div>
... <div>
... <p class="user-name">
... something in p tag
... </p>
... text data I want
... </div>
... </div>''')
>>> username = s2.css('p.user-name')[0]
>>> username.xpath('following-sibling::text()[1]').get()
'\n text data I want\n '
>>>