How to get immediate parent node with scrapy in py

2019-08-03 20:45发布

问题:

I am new to scrapy. I want to crawl some data from the web. I got the html document like below.

dom style1:
<div class="user-info">
    <p class="user-name">
        something in p tag
    </p>
    text data I want
</div>

dom style2:
<div class="user-info">
    <div>
        <p class="user-img">
            something in p tag
        </p>
        something in div tag
    </div>
    <div>
        <p class="user-name">
            something in p tag
        </p>
        text data I want
    </div>
</div>

I want to get the data text data I want, now I can use css or xpath selector to get it by check it exists. But I want to know some better ways. For example, I can get css p.user-name first, and then I get it's parent, and then I get it's div/text(), and always the data I want is the text() of the p.user-name's immediate parent div, but the question is, how can I get the immediate parent p.user-name?

回答1:

With xpath you can traverse the xml tree in every direction(parent, sibling, child etc.) where css doesn't support this.
For your case you can get node's parent with xpath .. parent notation:

//p[@class='user-name']/../text()

Explanation:
//p[@class='user-name'] - find <p> nodes with class value user-name.
/.. - select node's parent.
/text() - select text of the current node.

This xpath should work in both of your described cases.



回答2:

What about using following-sibling axis?

>>> s = scrapy.Selector(text='''<div class="user-info">
...     <p class="user-name">
...         something in p tag
...     </p>
...     text data I want
... </div>''')
>>> username = s.css('p.user-name')[0]
>>> username.xpath('following-sibling::text()[1]').get()
'\n    text data I want\n'
>>> 

>>> s2 = scrapy.Selector(text='''<div class="user-info">
...     <div>
...         <p class="user-img">
...             something in p tag
...         </p>
...         something in div tag
...     </div>
...     <div>
...         <p class="user-name">
...             something in p tag
...         </p>
...         text data I want
...     </div>
... </div>''')
>>> username = s2.css('p.user-name')[0]
>>> username.xpath('following-sibling::text()[1]').get()
'\n        text data I want\n    '
>>>