Scrapy works fine until page 12 of asp site, then

My first scraping project with Python/Scrapy. Site is http://pabigtrees.com/ with 78 pages and 20 items (trees) per page. This is the full spider with a few changes to provide a minimal demonstration (scraping only one value per page):

import scrapy
from pabigtrees.items import Tree

class TreesSpider(scrapy.Spider):
  name = "trees"
  start_urls = ["http://pabigtrees.com/view_tree.aspx"]
  allowed_domains = ["pabigtrees.com"]
  download_delay = 2

  def parse(self, response):
    for page in [1,11,12]:
    #for page in range(1,79):
      if page == 1:
        yield scrapy.FormRequest.from_response(
        response,
        #callback=self.parse_page
        callback=self.parse_test
        )
      else:
        yield scrapy.FormRequest.from_response(
          response,
          formdata={
            '__EVENTTARGET': 'ctl00$ContentPlaceHolder1$GridView1',
            '__EVENTARGUMENT': "Page$" + str(page),
            'ctl00$ContentPlaceHolder1$genus_latin': '0',
            'ctl00$ContentPlaceHolder1$genus_common': '0',
            'ctl00$ContentPlaceHolder1$county': '0',
            '__VIEWSTATE': response.css('input#__VIEWSTATE::attr(value)').extract_first(),
            '__VIEWSTATEGENERATOR': response.css('input#__VIEWSTATEGENERATOR::attr(value)').extract_first(),
            '__SCROLLPOSITIONX': response.css('input#__SCROLLPOSITIONX::attr(value)').extract_first(),
            '__SCROLLPOSITIONY': response.css('input#__SCROLLPOSITIONY::attr(value)').extract_first(),
            '__EVENTVALIDATION': response.css('input#__EVENTVALIDATION::attr(value)').extract_first()
          },
          #callback=self.parse_page
          callback=self.parse_test
        )

  def parse_test(self, response):
    yield {
      'county':response.xpath('//a[contains(@href,"Select$1''")]/../../../td[5]/font/text()').extract_first()
    }

  def parse_page(self, response):
    for tree in range(0,20):

      yield scrapy.FormRequest.from_response(
        response,
        formdata={
          '__EVENTTARGET': 'ctl00$ContentPlaceHolder1$GridView1',
          '__EVENTARGUMENT': "Select$" + str(tree)
        },        meta={'county':response.xpath('//a[contains(@href,"Select$'+str(tree)+'")]/../../../td[5]/font/text()').extract_first()}, # save the county from the list page because it is not available on the detail page
        callback=self.parse_results
      )

  def parse_results(self, response):
    item = Tree()
    genus = response.css('span#ctl00_ContentPlaceHolder1_tree_genus::text').extract()
    species = response.css('span#ctl00_ContentPlaceHolder1_tree_species::text').extract()
    circumference = response.css('span#ctl00_ContentPlaceHolder1_lblcircum::text').extract()
    spread = response.css('span#ctl00_ContentPlaceHolder1_lblSpread::text').extract()
    height = response.css('span#ctl00_ContentPlaceHolder1_lblHeight::text').extract()
    points = response.css('span#ctl00_ContentPlaceHolder1_lblPoints::text').extract()
    address = response.css('span#ctl00_ContentPlaceHolder1_lblAddress::text').extract()
    crew = response.xpath('//td[text()="Measuring Crew: "]/following-sibling::td/text()').extract()
    nominator = response.xpath('//td[text()="Original Nominator: "]/following-sibling::td/text()').extract()
    comments = response.xpath('//td[text()="Comments: "]/following-sibling::td/text()').extract()
    gps = response.xpath('//td[text()="GPS Coordinates: "]/following-sibling::td/text()').extract()
    technique = response.css('span#ctl00_ContentPlaceHolder1_lblTech::text').extract()
    yearnominated = response.css('span#ctl00_ContentPlaceHolder1_lblNom::text').extract()
    yearlastmeasured = response.css('span#ctl00_ContentPlaceHolder1_lblMeasured::text').extract()
    item['a_county'] = response.meta['county']
    item['b_genus'] = genus
    item['c_species'] = species
    item['d_circumference'] = circumference
    item['e_spread'] = spread
    item['f_height'] = height
    item['g_points'] = points
    item['h_address'] = address
    item['i_crew'] = crew
    item['j_nominator'] = nominator
    item['k_comments'] = comments
    item['l_gps'] = gps
    item['m_technique'] = technique
    item['n_yearnominated'] = yearnominated
    item['o_yearlastmeasured'] = yearlastmeasured
    return item

The crawler works fine up through page 11. On page 12 and above, I get 500 errors. I believe it has something to do with the pagination, but I think I am sending the correct VIEWSTATE etc. Here’s the output:

(python3) Al-Green:pabigtrees Tony$ scrapy crawl trees -o trees.csv
2018-04-14 15:31:18 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: pabigtrees)
2018-04-14 15:31:18 [scrapy.utils.log] INFO: Versions: lxml 4.2.1.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.4.0, w3lib 1.19.0, Twisted 17.9.0, Python 3.6.5 (v3.6.5:f59c0932b4, Mar 28 2018, 05:52:31) - [GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)], pyOpenSSL 17.5.0 (OpenSSL 1.1.0h  27 Mar 2018), cryptography 2.2.2, Platform Darwin-17.5.0-x86_64-i386-64bit
2018-04-14 15:31:18 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'pabigtrees', 'FEED_FORMAT': 'csv', 'FEED_URI': 'trees.csv', 'NEWSPIDER_MODULE': 'pabigtrees.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['pabigtrees.spiders']}
2018-04-14 15:31:18 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats']
2018-04-14 15:31:18 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-04-14 15:31:18 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2018-04-14 15:31:18 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2018-04-14 15:31:18 [scrapy.core.engine] INFO: Spider opened
2018-04-14 15:31:18 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-04-14 15:31:18 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-04-14 15:31:26 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://pabigtrees.com/robots.txt> (referer: None)
2018-04-14 15:31:30 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://pabigtrees.com/view_tree.aspx> (referer: None)
2018-04-14 15:31:30 [scrapy.core.engine] DEBUG: Crawled (200) <POST http://pabigtrees.com/view_tree.aspx> (referer: http://pabigtrees.com/view_tree.aspx)
2018-04-14 15:31:30 [scrapy.core.scraper] DEBUG: Scraped from <200 http://pabigtrees.com/view_tree.aspx>
{'county': 'Dauphin'}
2018-04-14 15:31:33 [scrapy.core.engine] DEBUG: Crawled (200) <POST http://pabigtrees.com/view_tree.aspx> (referer: http://pabigtrees.com/view_tree.aspx)
2018-04-14 15:31:33 [scrapy.core.scraper] DEBUG: Scraped from <200 http://pabigtrees.com/view_tree.aspx>
{'county': 'Delaware'}
2018-04-14 15:31:35 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <POST http://pabigtrees.com/view_tree.aspx> (failed 1 times): 500 Internal Server Error
2018-04-14 15:31:37 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <POST http://pabigtrees.com/view_tree.aspx> (failed 2 times): 500 Internal Server Error
2018-04-14 15:31:39 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <POST http://pabigtrees.com/view_tree.aspx> (failed 3 times): 500 Internal Server Error
2018-04-14 15:31:39 [scrapy.core.engine] DEBUG: Crawled (500) <POST http://pabigtrees.com/view_tree.aspx> (referer: http://pabigtrees.com/view_tree.aspx)
2018-04-14 15:31:39 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <500 http://pabigtrees.com/view_tree.aspx>: HTTP status code is not handled or not allowed
2018-04-14 15:31:39 [scrapy.core.engine] INFO: Closing spider (finished)
2018-04-14 15:31:39 [scrapy.extensions.feedexport] INFO: Stored csv feed (2 items) in: trees.csv
2018-04-14 15:31:39 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 134895,
 'downloader/request_count': 7,
 'downloader/request_method_count/GET': 2,
 'downloader/request_method_count/POST': 5,
 'downloader/response_bytes': 98019,
 'downloader/response_count': 7,
 'downloader/response_status_count/200': 3,
 'downloader/response_status_count/404': 1,
 'downloader/response_status_count/500': 3,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2018, 4, 14, 19, 31, 39, 475017),
 'httperror/response_ignored_count': 1,
 'httperror/response_ignored_status_count/500': 1,
 'item_scraped_count': 2,
 'log_count/DEBUG': 11,
 'log_count/INFO': 9,
 'memusage/max': 50180096,
 'memusage/startup': 50176000,
 'request_depth_max': 1,
 'response_received_count': 5,
 'retry/count': 2,
 'retry/max_reached': 1,
 'retry/reason_count/500 Internal Server Error': 2,
 'scheduler/dequeued': 6,
 'scheduler/dequeued/memory': 6,
 'scheduler/enqueued': 6,
 'scheduler/enqueued/memory': 6,
 'start_time': datetime.datetime(2018, 4, 14, 19, 31, 18, 563326)}
2018-04-14 15:31:39 [scrapy.core.engine] INFO: Spider closed (finished)

I’m stumped, thanks!

标签： python scrapy

1条回答

forever°为你锁心

2楼-- · 2019-09-20 17:35

The __VIEWSTATE is indeed what is causing you trouble.

If you take a look at the navigation of the site you're trying to scrape, you'll see it only links to 10 other pages:

Those are the only 10 links of this search you're allowed to access from the current page (with the current view state). The next 10 will be accessible from page 11 of the search.

One possible solution would be to check in parse_page() if you're on page 11 (or 21, or 31...), and if so, create the requests for the next 10 pages.

Also, you only need to populate the formdata you want to change, FormRequest.from_response() will take care of the ones available in hidden input fields, such as e.g. __VIEWSTATE or __EVENTVALIDATION.

0人赞添加讨论(0) 举报

Scrapy works fine until page 12 of asp site, then

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间