For a project, I am running a broad number of Scrapy requests for certain search terms. These requests use the same search terms but different time horizons, as shown through the dates in the URLs below.
Despite the different dates and different pages the URLs refer to, I am receiving the same value as output for all requests. It appears like the script is taking the first value obtained and is assigning the same output to all subsequent requests.
import scrapy
class QuotesSpider(scrapy.Spider):
name = 'quotes'
allowed_domains = ['google.com']
start_urls = ['https://www.google.com/search?q=Activision&biw=1280&bih=607&source=lnt&tbs=cdr%3A1%2Ccd_min%3A01%2F01%2F2004%2Ccd_max%3A12%2F31%2F2004&tbm=nws',
'https://www.google.com/search?q=Activision&biw=1280&bih=607&source=lnt&tbs=cdr%3A1%2Ccd_min%3A01%2F01%2F2005%2Ccd_max%3A12%2F31%2F2005&tbm=nws',
'https://www.google.com/search?q=Activision&biw=1280&bih=607&source=lnt&tbs=cdr%3A1%2Ccd_min%3A01%2F01%2F2006%2Ccd_max%3A12%2F31%2F2006&tbm=nws',
]
def parse(self, response):
item = {
'search_title': response.css('input#sbhost::attr(value)').get(),
'results': response.css('#resultStats::text').get(),
'url': response.url,
}
yield item
I have found a thread discussing a similar problem with BeautifulSoup. The solution was here to add headers to the script, hence making it use a browser as User-Agent:
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36"
}
payload = {'as_epq': 'James Clark', 'tbs':'cdr:1,cd_min:01/01/2015,cd_max:01/01/2015', 'tbm':'nws'}
r = requests.get("https://www.google.com/search", params=payload, headers=headers)
The approach to apply the headers in Scrapy seems to be different though. Does anyone know how it can best be included in Scrapy, particularly with reference to start_urls
, which contains several URLs at once?
As per Scrapy 1.7.3 document. Your header wont be generic as others. It should be same as that of site that you are scraping. You will get to know the headers from console network tab.
Add them like the below and print the response.
You don't need to modify the headers here. You need to set the user agent which Scrapy allows you to do directly.
Now you'll get output like: