Avoiding redirection

2019-02-26 00:06发布

I'm trying to parse a site(written in ASP) and the crawler gets redirected to the main site. But what I'd like to do is to parse the given url, not the redirected one. Is there a way to do this?. I tried adding "REDIRECT=False" to the settings.py file without success.

Here's some output from the crawler:

2011-09-24 20:01:11-0300 [coto] DEBUG: Redirecting (302) to <GET http://www.cotodigital.com.ar/default.asp> from <GET http://www.cotodigital.com.ar/l.asp?cat=500&id=500>
2011-09-24 20:01:11-0300 [coto] DEBUG: Redirecting (302) to <GET http://www.cotodigital.com.ar/default.asp> from <GET http://www.cotodigital.com.ar/l.asp?cat=1513&id=1513>
2011-09-24 20:01:11-0300 [coto] DEBUG: Redirecting (302) to <GET http://www.cotodigital.com.ar/default.asp> from <GET http://www.cotodigital.com.ar/l.asp?cat=476&id=476>
2011-09-24 20:01:11-0300 [coto] DEBUG: Redirecting (302) to <GET http://www.cotodigital.com.ar/default.asp> from <GET http://www.cotodigital.com.ar/l.asp?cat=472&id=472>
2011-09-24 20:01:11-0300 [coto] DEBUG: Redirecting (302) to <GET http://www.cotodigital.com.ar/default.asp> from <GET http://www.cotodigital.com.ar/l.asp?cat=457&id=457>
2011-09-24 20:01:11-0300 [coto] DEBUG: Redirecting (302) to <GET http://www.cotodigital.com.ar/default.asp> from <GET http://www.cotodigital.com.ar/l.asp?cat=1097&id=1097>

标签: python scrapy
2条回答
别忘想泡老子
2楼-- · 2019-02-26 01:02

http://www.cotodigital.com.ar/l.asp?cat=1097&id=1097 redirects to http://www.cotodigital.com.ar/default.asp because HTTP response said to so. This happens because asp code is checking for some condition - a wrong page, or cookies, or user-agent, or referrer. Check the mentioned conditions.

UPDATE: Just checked in my browser: the browser is also redirected to the main page, where i click 'Skip ads'. After that it works OK.

This means it sets some cookies, without which it redirects to the main page.

See also Scrapy - how to manage cookies/sessions

查看更多
欢心
3楼-- · 2019-02-26 01:03

The original URL has nothing to scrape. It returned 302, meaning there is no body, and the Location header indicates where to redirect to. You need to figure out how to access the URL without being redirected, perhaps by authenticating.

查看更多
登录 后发表回答