Using scrapy to extract XHR request?

2019-07-02 16:57发布

I'm trying to scrape social like counts that are being generated with javascript. I am able to scrape the desired data if I absolutely reference the XHR url. But the site I am trying to scrape dynamically generates these XMLHttpRequests with query string parameters that I do not know how to extract.

For example, you can see that using the m, p, i, and g parameters unique to each page are used to construct the request url.

Query String Parameters

Here is the assembled url:

http://aeon.co/magazine/social/social.php?url=http://aeon.co/magazine/technology/the-elon-musk-interview-on-mars/&m=1385983411&p=1412056831&i=25829&g=http://aeon.co/magazine/?p=25829

..which returns this JSON:

{"twitter":13325,"facebook":23481,"googleplusone":964,"disqus":272}

Using the following script, I am able to extract desired data (in this case twitter count) from the request url i just mentioned but only for that specific page.

import scrapy

from aeon.items import AeonItem
import json
from scrapy.http.request import Request

class AeonSpider(scrapy.Spider):
    name = "aeon"
    allowed_domains = ["aeon.co"]
    start_urls = [
        "http://aeon.co/magazine/technology"
]

def parse(self, response):
    items = []
    for sel in response.xpath('//*[@id="latestPosts"]/div/div/div'):
        item = AeonItem()
        item['title'] = sel.xpath('./a/p[1]/text()').extract()
        item['primary_url'] = sel.xpath('./a/@href').extract() 
        item['word_count'] = sel.xpath('./a/div/span[2]/text()').extract()      

        for each in item['primary_url']:
            yield Request(http://aeon.co/magazine/social/social.php?url=http://aeon.co/magazine/technology/the-elon-musk-interview-on-mars/&m=1385983411&p=1412056831&i=25829&g=http://aeon.co/magazine/?p=25829, callback=self.parse_XHR_data,meta={'item':item})                   


def XHR_data(self, response):
    jsonresponse = json.loads(response.body_as_unicode())
    item = response.meta['item']
    item["tw_count"] = jsonresponse["twitter"]  
    yield item    

so my question is, how can I extract the m,p,i and g url query parameters so that I can dynamically simulate the request url? (rather than absolutely referencing it as shown above)

1条回答
相关推荐>>
2楼-- · 2019-07-02 17:18

This how you can extract your url:

import urlparse
url = 'http://aeon.co/magazine/social/social.php?url=http://aeon.co/magazine/technology/the-elon-musk-interview-on-mars/&m=1385983411&p=1412056831&i=25829&g=http://aeon.co/magazine/?p=25829'

parsed_url = urlparse.parse_qs(urlparse.urlparse(url).query)

for p in parsed_url:
    print p + '=' + parsed_url[p][0]

and output:

>> python test.py
url=http://aeon.co/magazine/technology/the-elon-musk-interview-on-mars/
p=1412056831
m=1385983411
i=25829
g=http://aeon.co/magazine/?p=25829
查看更多
登录 后发表回答