XPath syntax: How to get the child div information

2019-07-23 08:27发布

The result from my scrapy project looks like this:

<div class="news_li">...</div>
<div class="news_li">...</div>
<div class="news_li">...</div>
...
<div class="news_li">...</div>

And each "news_li" class looks like this:

 <div class="news_li">
    <div class="a">
        <a href="aaa">
            <div class="a1"></div>
        </a>
    </div>
    <a href="xxx">
        <div class="b">
            <div class="b1"></div>
            <div class="b2"></div>
            <div class="b3"></div>
        </div>
    </a>
</div>

I am trying to extract information one at a time in the scrapy shell by the following command:

response.xpath("//div[@class='news_li']")[0].xpath("//div[@class='a1']").extract()
response.xpath("//div[@class='news_li  ']/descendant::div[@class='a1']").extract()

But these commands returns me with all the "a1" class from all other "news_li" class

I have 2 quesitons:

  1. How do I get the child div information one at a time.

  2. How do I get the <a href="aaa"> </a> and <a href="xxx"> </a> separately? (The difference is the first one is wrap in a parent div and the second one is by itself.)

Many Many thanks in advance.

Edit: To be specific, how can i extract the information depends on the parent /root node? I look up XPath Axes and I tried with 'descendant', but it does not work.

3条回答
啃猪蹄的小仙女
2楼-- · 2019-07-23 09:06

Here's what you can try

response.xpath("(//div[@class='news_li'])[0]").xpath("//div[@class='a1']").extract()

Use the [0] directly in the XPath.

查看更多
Summer. ? 凉城
3楼-- · 2019-07-23 09:18

Try with the below.

# first link
response.xpath("(//div[@class='news_li']//a)[1]").extract()
# second link
response.xpath("(//div[@class='news_li']//a)[2]").extract()

Edit 1:

 # change the X value in the below xpath to get the first link
//div[@class='news_li'][X]/descendant::div[@class='a1']/parent::a

 # change the X value in the below xpath to get the second link (direct
 # link) based on the child div
 //div[@class='news_li'][X]/descendant::a[div[@class='b']]
查看更多
Juvenile、少年°
4楼-- · 2019-07-23 09:26

It is very likely that when combining XPath expressions like so:

response.xpath("//div[@class='news_li']")[0].xpath("//div[@class='a1']").extract()

if the second expression starts with a double slash //, then elements are selected anywhere in the document, regardless of what was selected before. Put another way: even if the first expression:

//div[@class='news_li']

selects only div elements with a certain class attribute, the next one:

//div[@class='a1']

selects all div elements where @class='a1' in the entire document. That seems to be your problem.

Solution: Use a relative path

One possible solution is to use a relative path expression that does not start with //:

response.xpath("//div[@class='news_li']")[0].xpath(".//div[@class='a1']").extract()

General remarks

Depending on the structure of your actual documents and if you can make certain assumptions, better solutions may be possible.

Also, in general, to process results "one at a time", you should

  • write an XPath expression that selects all of those desired elements and return them as a list
  • process each item in this list individually, for example with Python code
查看更多
登录 后发表回答