I'm starting to use scrapy and xpath to scrape some page, I'm just trying simple things using ipython, an I get response in some pages like in IMDB, but when I try in others like www.bbb.org I always get an empty list. This is what I'm doing:
scrapy shell 'http://www.bbb.org/central-western-massachusetts/business-reviews/auto-repair-and-service/toms-automotive-in-fitchburg-ma-211787'
BBB Accreditation
A BBB Accredited Business since 02/12/2010
BBB has determined that Tom's Automotive meets BBB accreditation standards, which include a commitment to......"
the xpath of this paragraph is:
'//*[@id="business-accreditation-content"]/p[2]'
So I use:
data = response.xpath('//*[@id="business-accreditation-content"]/p[2]').extract()
But data
is an empty list, I'm getting the Xpath with chrome and it works in other pages, but here I get nothing regardless what part of the page I try.
The website actually checks for the
User-Agent
header.See what it returns if you don't specify it:
Yes, that's right - the response contains only
123
if there is an unexpected request user agent.Now with the header (note the specified
-s
command-line argument):This was an example from the shell. In a real Scrapy project, you would need to set the
USER_AGENT
project setting. Or, you may also use user agent rotation with the help of this middleware:scrapy-fake-useragent
.