Crawling dynamic content with scrapy

2019-07-15 04:21发布

问题:

I am trying to get latest review form google play store. I am following this question for getting the latest reviews here

Method specified in the above link's answer works fine with scrapy shell but when I try this in my crawler it gets completely ignored.

Code snippet:

import re
import sys
import time
import urllib
import urlparse

from scrapy import Spider
from scrapy.spider import BaseSpider
from scrapy.http import Request, FormRequest
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.lxmlhtml import LxmlLinkExtractor

from play.items import PlayApp

class PlaySpider(CrawlSpider):
    name = "play"
    allowed_domains = ["play.google.com"]
    start_urls = [
            "https://play.google.com/store/apps"
        ]

    rules = (
        Rule(LxmlLinkExtractor(allow=('/store/apps$', )), callback='parseCategory',follow=True),
    )

    def parseCategory(self, response):
        """
            gets categories from store home page call parseLinks for each category
        """
        #something here......
        yield Request(categoryapps, callback=self.parseLinks)

    def parseLinks(self, response):

        '''
        get all the links from the category page and then 
        pasess individual links to parseApp function.
        '''    
        #something here

        yield Request(link, callback=self.parseApp)

    def parseApp(self, response):

        '''
        parses apps page to get info about the app
        '''

        #application page parsing ......        

        frmdata = {"id": "com.supercell.boombeach", "reviewType": '0', "reviewSortOrder": '0', "pageNum":'0'}
        url = "https://play.google.com/store/getreviews"
        yield FormRequest(url, callback=self.parse_data, formdata=frmdata)

        yield app

    def parse_data(self, response):
        # do stuff with data...
        print '\n\n---------------I am here------------------\n\n'

This function parse_data is never called. Asked this on #scrapy IRC and few other places but no help. Please help me with this.

this is DEBUG response on terminal:

DEBUG: Crawled (200) <POST https://play.google.com/store/getreviews> (referer: https://play.google.com/store/apps/details?id=isoft.studios.ncert.ncertbooks)
2015-06-03 13:56:07+0530 [play] DEBUG: Crawled (200) <POST https://play.google.com/store/getreviews> (referer: https://play.google.com/store/apps/details?id=af.hindi.stories.booktwo)
2015-06-03 13:56:07+0530 [play] DEBUG: Crawled (200) <POST https://play.google.com/store/getreviews> (referer: https://play.google.com/store/apps/details?id=com.frozenex.latestnewsms)
2015-06-03 13:56:07+0530 [play] DEBUG: Crawled (200) <POST https://play.google.com/store/getreviews> (referer: https://play.google.com/store/apps/details?id=com.aqua.apps.english.hindi.dictionary)
2015-06-03 13:56:07+0530 [play] DEBUG: Crawled (200) <POST https://play.google.com/store/getreviews> (referer: https://play.google.com/store/apps/details?id=com.merriamwebster)
2015-06-03 13:56:08+0530 [play] DEBUG: Crawled (200) <POST https://play.google.com/store/getreviews> (referer: https://play.google.com/store/apps/details?id=an.HindiTranslate)

So a POST request is indeed getting sent but callback method is not called.

回答1:

Seems like you haven't changing the id in the form data.

def parseApp(self, response):
    apps = list(set(response.xpath('//a[@class="card-click-target"]/@href').extract()))
    url = "https://play.google.com/store/getreviews"
    for app in apps:
        _id = app.strip('/store/apps/details?id=')
        form_data = {"id": _id, "reviewType": '0', "reviewSortOrder": '0', "pageNum":'0'}
        sleep(5)
        yield FormRequest(url=url, formdata=form_data, callback=self.parse_data)

def parse_app(self, response):
    response_data = re.findall("\[\[.*", response.body)
    if response_data:
        try:
            text = json.loads(response_data[0] + ']')
            sell = Selector(text=text[0][2])
        except:
            pass
        # do whatever you want to extract using sell.xapth('YOUR_XPATH_HERE')

A sample review after cleaning the data you will be getting something like this

<div class="single-review">
    <a href="/store/people/details?id=106726831005267540508">
        <img class="author-image" alt="Lorence Gerona avatar image" src="https://lh3.googleusercontent.com/uFp_tsTJboUY7kue5XAsGA=w48-c-h48">
    </a>
    <div class="review-header" data-expand-target="" data-reviewid="gp:AOqpTOHnsExa_P6JFRJD6HF5h71fpY91tNaEODjtfiTu-zPFki9ZnYsNp1HEcGFpGEfu9xqwJL_j-03Tx0e9lw">
        <div class="review-info">
            <span class="author-name">
                <a href="/store/people/details?id=106726831005267540508">Lorence Gerona</a>
            </span>
            <span class="review-date">3 June 2015</span>
            <a class="reviews-permalink" href="/store/apps/details?id=com.supercell.boombeach&amp;reviewId=Z3A6QU9xcFRPSG5zRXhhX1A2SkZSSkQ2SEY1aDcxZnBZOTF0TmFFT0RqdGZpVHUtelBGa2k5Wm5Zc05wMUhFY0dGcEdFZnU5eHF3Skxfai0wM1R4MGU5bHc" title="Link to this review"></a> <div class="review-source" style="display:none">

        </div>
        <div class="review-info-star-rating">
            <div class="tiny-star star-rating-non-editable-container" aria-label="Rated 5 stars out of five stars">
                <div class="current-rating" style="width: 100%;">

                </div>
            </div>
        </div>
    </div>
    <div class="rate-review-wrapper">
        <div class="play-button icon-button small rate-review" title="Spam" data-rating="SPAM">
            <div class="icon spam-flag"></div>
        </div>
        <div class="play-button icon-button small rate-review" title="Helpful" data-rating="HELPFUL">
            <div class="icon thumbs-up"></div>
        </div>
        <div class="play-button icon-button small rate-review" title="Unhelpful" data-rating="UNHELPFUL"> <div class="icon thumbs-down"></div>
    </div>
</div>
</div>
<div class="review-body">
<span class="review-title">Team BOOM BEACH</span>
Amazing game I can defeat hammerman
<div class="review-link" style="display:none">
    <a class="id-no-nav play-button tiny" href="#" target="_blank">Full Review</a>
</div>
</div>
</div>