Struggling with XPath expression for Scrapy

2019-06-02 19:28发布

Below, there is the part of some html page (all names of the parameters are in russian). It has the main class and two inner classes. The detailed html-code:

    <div class="obj-params">
            <div class="wrap">
                <div class="obj-params-col" style="min-width:50%;">
                      <p>
                         <b>Param1_name</b>" Param1_value"</p>
                      <p>
                         <strong>Param2_name</strong>" Param2_value</p>
                      <p>
                         <strong>Param3_name</strong>" Param3_value"</p>
                </div>
              </div>
            <div class="wrap">
                <div class="obj-params-col">
                    <p>
                       <b>Param4_name</b>Param4_value</p>
                <div class="inline-popup popup-hor left">
                   <b>Param5_name</b>
                      <a target="_blank" href="link">Param5_value</a></div></div>

I would like to extract the Param%d_value's values. How can I do it using XPath?

I have tried the following expressions:

//div[@class="inline-popup popup-hor left"]/a/text() #extract correctly the name of the link

However, this expression forms me a list of all Param%d_value instead of putting them in organized order:

//div[@class="obj-params"]/div[@class="obj-params-col"]/p/text()

The question is - how can I construct (per each param_value) XPath expression ? E.x. when I use the following XPath expression

//div[@class="obj-params"]//div[@class="obj-params-col"]/p/child::text()

['Param1_value, Param2_value, Param3_value, Param1_value, Param2_value, Param3_value, Param1_value, Param2_value, Param3_value']

what I need to get is the following:

XPath_expression_to_extract_only_Param1_value:

['Param1_value, Param1_value, Param1_value, Param1_value, Param1_value, Param1_value, Param1_value, Param1_value, Param1_value']  


XPath_expression_to_extract_only_Param2_value:

['Param2_value, Param2_value, Param2_value, Param2_value, Param2_value, Param2_value, Param2_value, Param2_value, Param2_value']


XPath_expression_to_extract_only_Param3_value:

['Param3_value, Param3_value, Param3_value, Param3_value, Param3_value, Param3_value, Param3_value, Param3_value, Param3_value']              

2条回答
叼着烟拽天下
2楼-- · 2019-06-02 19:44

You can use child::text() to get the text nodes out of the div with obj-params-col class:

//div[@class="obj-params"]//div[@class="obj-params-col"]/p/child::text()

Demo (using xmllint):

$ xmllint index.html --xpath '//div[@class="obj-params"]//div[@class="obj-params-col"]/p/child::text()'
" Param1_value"
" Param2_value
" Param3_value"

UPDATE:

If you need to get param value by param name, use:

//*[text()="Param1_name"]/following-sibling::text()
查看更多
小情绪 Triste *
3楼-- · 2019-06-02 20:01
sel.xpath('//*[contains(./text(),"Param1_name")]/following-sibling::text()').extract()
sel.xpath('//*[contains(./text(),"Param2_name")]/following-sibling::text()').extract()
sel.xpath('//*[contains(./text(),"Param3_name")]/following-sibling::text()').extract()
查看更多
登录 后发表回答