可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I am new to scrapy. I want to crawl some data from the web. I got the html document like below.

dom style1:
<div class="user-info">
    <p class="user-name">
        something in p tag
    </p>
    text data I want
</div>

dom style2:
<div class="user-info">
    <div>
        <p class="user-img">
            something in p tag
        </p>
        something in div tag
    </div>
    <div>
        <p class="user-name">
            something in p tag
        </p>
        text data I want
    </div>
</div>

I want to get the data text data I want, now I can use css or xpath selector to get it by check it exists. But I want to know some better ways. For example, I can get css p.user-name first, and then I get it's parent, and then I get it's div/text(), and always the data I want is the text() of the p.user-name's immediate parent div, but the question is, how can I get the immediate parent p.user-name?

回答1:

With xpath you can traverse the xml tree in every direction(parent, sibling, child etc.) where css doesn't support this.
For your case you can get node's parent with xpath .. parent notation:

//p[@class='user-name']/../text()

Explanation:
//p[@class='user-name'] - find <p> nodes with class value user-name.
/.. - select node's parent.
/text() - select text of the current node.

This xpath should work in both of your described cases.

回答2:

What about using following-sibling axis?

>>> s = scrapy.Selector(text='''<div class="user-info">
...     <p class="user-name">
...         something in p tag
...     </p>
...     text data I want
... </div>''')
>>> username = s.css('p.user-name')[0]
>>> username.xpath('following-sibling::text()[1]').get()
'\n    text data I want\n'
>>> 

>>> s2 = scrapy.Selector(text='''<div class="user-info">
...     <div>
...         <p class="user-img">
...             something in p tag
...         </p>
...         something in div tag
...     </div>
...     <div>
...         <p class="user-name">
...             something in p tag
...         </p>
...         text data I want
...     </div>
... </div>''')
>>> username = s2.css('p.user-name')[0]
>>> username.xpath('following-sibling::text()[1]').get()
'\n        text data I want\n    '
>>>