I'm trying to scrape social like counts that are being generated with javascript. I am able to scrape the desired data if I absolutely reference the XHR url. But the site I am trying to scrape dynamically generates these XMLHttpRequests with query string parameters that I do not know how to extract.
For example, you can see that using the m, p, i, and g parameters unique to each page are used to construct the request url.
Here is the assembled url:
..which returns this JSON:
{"twitter":13325,"facebook":23481,"googleplusone":964,"disqus":272}
Using the following script, I am able to extract desired data (in this case twitter count) from the request url i just mentioned but only for that specific page.
import scrapy
from aeon.items import AeonItem
import json
from scrapy.http.request import Request
class AeonSpider(scrapy.Spider):
name = "aeon"
allowed_domains = ["aeon.co"]
start_urls = [
"http://aeon.co/magazine/technology"
]
def parse(self, response):
items = []
for sel in response.xpath('//*[@id="latestPosts"]/div/div/div'):
item = AeonItem()
item['title'] = sel.xpath('./a/p[1]/text()').extract()
item['primary_url'] = sel.xpath('./a/@href').extract()
item['word_count'] = sel.xpath('./a/div/span[2]/text()').extract()
for each in item['primary_url']:
yield Request(http://aeon.co/magazine/social/social.php?url=http://aeon.co/magazine/technology/the-elon-musk-interview-on-mars/&m=1385983411&p=1412056831&i=25829&g=http://aeon.co/magazine/?p=25829, callback=self.parse_XHR_data,meta={'item':item})
def XHR_data(self, response):
jsonresponse = json.loads(response.body_as_unicode())
item = response.meta['item']
item["tw_count"] = jsonresponse["twitter"]
yield item
so my question is, how can I extract the m,p,i and g url query parameters so that I can dynamically simulate the request url? (rather than absolutely referencing it as shown above)
This how you can extract your url:
and output: