Adding Headers to Scrapy Spider

For a project, I am running a broad number of Scrapy requests for certain search terms. These requests use the same search terms but different time horizons, as shown through the dates in the URLs below.

Despite the different dates and different pages the URLs refer to, I am receiving the same value as output for all requests. It appears like the script is taking the first value obtained and is assigning the same output to all subsequent requests.

import scrapy

 class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    allowed_domains = ['google.com']
    start_urls = ['https://www.google.com/search?q=Activision&biw=1280&bih=607&source=lnt&tbs=cdr%3A1%2Ccd_min%3A01%2F01%2F2004%2Ccd_max%3A12%2F31%2F2004&tbm=nws',
                  'https://www.google.com/search?q=Activision&biw=1280&bih=607&source=lnt&tbs=cdr%3A1%2Ccd_min%3A01%2F01%2F2005%2Ccd_max%3A12%2F31%2F2005&tbm=nws',
                  'https://www.google.com/search?q=Activision&biw=1280&bih=607&source=lnt&tbs=cdr%3A1%2Ccd_min%3A01%2F01%2F2006%2Ccd_max%3A12%2F31%2F2006&tbm=nws',
    ]

    def parse(self, response):
        item = {
            'search_title': response.css('input#sbhost::attr(value)').get(),
            'results': response.css('#resultStats::text').get(),
            'url': response.url,
        }
        yield item

I have found a thread discussing a similar problem with BeautifulSoup. The solution was here to add headers to the script, hence making it use a browser as User-Agent:

headers = {
    "User-Agent":
        "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36"
}
payload = {'as_epq': 'James Clark', 'tbs':'cdr:1,cd_min:01/01/2015,cd_max:01/01/2015', 'tbm':'nws'}
r = requests.get("https://www.google.com/search", params=payload, headers=headers)

The approach to apply the headers in Scrapy seems to be different though. Does anyone know how it can best be included in Scrapy, particularly with reference to start_urls, which contains several URLs at once?

标签： python scrapy

2条回答

Deceive 欺骗

2楼-- · 2020-07-22 16:46

As per Scrapy 1.7.3 document. Your header wont be generic as others. It should be same as that of site that you are scraping. You will get to know the headers from console network tab.

Add them like the below and print the response.

# -*- coding: utf-8 -*-
import scrapy
#import logging

class AaidSpider(scrapy.Spider):
    name = 'aaid'

    def parse(self, response):
        url = "https://www.eventscribe.com/2019/AAOMS-CSIOMS/ajaxcalls/PresenterInfo.asp?efp=SVNVS1VRTEo4MDMx&PresenterID=597498&rnd=0.8680339"

        # Set the headers here. 
        headers =  {
            'Accept': '*/*',
            'Accept-Encoding': 'gzip, deflate, br',
            'Accept-Language': 'en-GB,en-US;q=0.9,en;q=0.8',
            'Connection': 'keep-alive',
            'Host': 'www.eventscribe.com',
            'Referer': 'https://www.eventscribe.com/2018/ADEA/speakers.asp?h=Browse%20By%20Speaker',
            'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36',
            'X-Requested-With': 'XMLHttpRequest'
        }
# Send the request
        scrapy.http.Request(url, method='GET' , headers = headers,  dont_filter=False)

        print(response.body) #If the response is HTML
        #If the response is json ; import json
        #jsonresponse = json.loads(response.body_as_unicode())
        #print jsonresponse

0人赞添加讨论(0) 举报

疯言疯语

3楼-- · 2020-07-22 16:56

You don't need to modify the headers here. You need to set the user agent which Scrapy allows you to do directly.

import scrapy

class QuotesSpider(scrapy.Spider):
    # ...
    user_agent = 'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36'
    # ...

Now you'll get output like:

'results': 'About 357 results', ...
'results': 'About 215 results', ...
'results': 'About 870 results', ...

0人赞添加讨论(0) 举报

Adding Headers to Scrapy Spider

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间