Trying to automatically grab the search results from a public search, but running into some trouble. The URL is of the form
http://www.website.com/search.aspx?keyword=#&&page=1&sort=Sorting
As I click through the pages, after visiting this page, it changes slightly to
http://www.website.com/search.aspx?keyword=#&&sort=Sorting&page=2
Problem being, if I then try to directly visit the second link without first visiting the first link, I am redirected to the first link. My current attempt at this is defining a long list of start_urls in scrapy.
class websiteSpider(BaseSpider):
name = "website"
allowed_domains = ["website.com"]
baseUrl = "http://www.website.com/search.aspx?keyword=#&&sort=Sorting&page="
start_urls = [(baseUrl+str(i)) for i in range(1,1000)]
Currently this code simply ends up visiting the first page over and over again. I feel like this is probably straightforward, but I don't quite know how to get around this.
UPDATE: Made some progress investigating this and found that the site updates each page by sending a POST request to the previous page using __doPostBack(arg1, arg2). My question now is how exactly do I mimic this POST request using scrapy. I know how to make a POST request, but not exactly how to pass it the arguments I want.
SECOND UPDATE: I've been making a lot of progress! I think... I looked through examples and documentation and eventually slapped together this version of what I think should do the trick:
def start_requests(self):
baseUrl = "http://www.website.com/search.aspx?keyword=#&&sort=Sorting&page="
target = 'ctl00$empcnt$ucResults$pagination'
requests = []
for i in range(1, 5):
url = baseUrl + str(i)
argument = str(i+1)
data = {'__EVENTTARGET': target, '__EVENTARGUMENT': argument}
currentPage = FormRequest(url, data)
requests.append(currentPage)
return requests
The idea is that this treats the POST request just like a form and updates accordingly. However, when I actually try to run this I get the following traceback(s) (Condensed for brevity):
2013-03-22 04:03:03-0400 [guru] ERROR: Unhandled error on engine.crawl()
dfd.addCallbacks(request.callback or spider.parse, request.errback)
File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 280, in addCallbacks
assert callable(callback)
exceptions.AssertionError:
2013-03-22 04:03:03-0400 [-] ERROR: Unhandled error in Deferred:
2013-03-22 04:03:03-0400 [-] Unhandled Error
Traceback (most recent call last):
Failure: scrapy.exceptions.IgnoreRequest: Skipped (request already seen)
Changing question to be more directed at what this post has turned into.
Thoughts?
P.S. When the second errors happen scrapy is unable to cleany shutdown and I have to send a SIGINT twice to get things to actually wrap up.