The result from my scrapy project looks like this:
<div class="news_li">...</div>
<div class="news_li">...</div>
<div class="news_li">...</div>
...
<div class="news_li">...</div>
And each "news_li" class looks like this:
<div class="news_li">
<div class="a">
<a href="aaa">
<div class="a1"></div>
</a>
</div>
<a href="xxx">
<div class="b">
<div class="b1"></div>
<div class="b2"></div>
<div class="b3"></div>
</div>
</a>
</div>
I am trying to extract information one at a time in the scrapy shell by the following command:
response.xpath("//div[@class='news_li']")[0].xpath("//div[@class='a1']").extract()
response.xpath("//div[@class='news_li ']/descendant::div[@class='a1']").extract()
But these commands returns me with all the "a1" class from all other "news_li" class
I have 2 quesitons:
How do I get the child div information one at a time.
How do I get the
<a href="aaa"> </a> and <a href="xxx"> </a>
separately? (The difference is the first one is wrap in a parent div and the second one is by itself.)
Many Many thanks in advance.
Edit: To be specific, how can i extract the information depends on the parent /root node? I look up XPath Axes and I tried with 'descendant', but it does not work.
Here's what you can try
Use the [0] directly in the XPath.
Try with the below.
Edit 1:
It is very likely that when combining XPath expressions like so:
if the second expression starts with a double slash
//
, then elements are selected anywhere in the document, regardless of what was selected before. Put another way: even if the first expression:selects only
div
elements with a certain class attribute, the next one:selects all
div
elements where@class='a1'
in the entire document. That seems to be your problem.Solution: Use a relative path
One possible solution is to use a relative path expression that does not start with
//
:General remarks
Depending on the structure of your actual documents and if you can make certain assumptions, better solutions may be possible.
Also, in general, to process results "one at a time", you should