Script always gets a 302 response when pulling ran

I can pull a any page from wikipedia with

import httplib
conn = httplib.HTTPConnection("en.wikipedia.org")
conn.debuglevel = 1
conn.request("GET","/wiki/Normal_Distribution",headers={'User-Agent':'Python httplib'})
r1 = conn.getresponse()
r1.read()

The normal response will be

reply: 'HTTP/1.0 200 OK\r\n'
header: Date: Sun, 03 Apr 2011 23:49:36 GMT
header: Server: Apache
header: Cache-Control: private, s-maxage=0, max-age=0, must-revalidate
header: Content-Language: en
header: Vary: Accept-Encoding,Cookie
header: Last-Modified: Sun, 03 Apr 2011 17:23:50 GMT
header: Content-Length: 263638
header: Content-Type: text/html; charset=UTF-8
header: Age: 1280309
header: X-Cache: HIT from sq77.wikimedia.org
header: X-Cache-Lookup: HIT from sq77.wikimedia.org:3128
header: X-Cache: MISS from sq66.wikimedia.org
header: X-Cache-Lookup: MISS from sq66.wikimedia.org:80
header: Connection: close

But if I try to pull a random page with /wiki/Special:Random I get a 302 response and an empty page

reply: 'HTTP/1.0 302 Moved Temporarily\r\n'
header: Date: Mon, 18 Apr 2011 19:25:52 GMT
header: Server: Apache
header: Cache-Control: private, s-maxage=0, max-age=0, must-revalidate
header: Vary: Accept-Encoding,Cookie
header: Expires: Thu, 01 Jan 1970 00:00:00 GMT
header: Location: http://en.wikipedia.org/wiki/Tuticorin_Port_Trust
header: Content-Length: 0
header: Content-Type: text/html; charset=utf-8
header: X-Cache: MISS from sq60.wikimedia.org
header: X-Cache-Lookup: MISS from sq60.wikimedia.org:3128
header: X-Cache: MISS from sq62.wikimedia.org
header: X-Cache-Lookup: MISS from sq62.wikimedia.org:80
header: Connection: close

How do I get a non-empty random page?

标签： python http wikipedia

4条回答

不美不萌又怎样

2楼-- · 2019-07-08 22:22

HTTP code 302 means you are being redirected. If you look at the Location header, you will see where you should make your new request. Then you can make the request to that URL and you'll hopefully get a 200 on that page.

To clarify: You are being requested to retry the request elsewhere. That's why your client needs to make another request when it receives a 302. Wikipedia's random page apparently works by choosing a random page in its database, then returning a 302 response with the new page as the Location field. If you look at other 302 responses, I'm sure you'll see a different page in the Location field.

0人赞添加讨论(0) 举报

The star\"

3楼-- · 2019-07-08 22:22

Look at the location header:

header: Location: http://en.wikipedia.org/wiki/Tuticorin_Port_Trust

It says you should redirected to that page. Read that header and do another request to that page.

0人赞添加讨论(0) 举报

爷的心禁止访问

4楼-- · 2019-07-08 22:28

When you're redirected the response object is going to have a code of 302 and the geturl() method will report the redirect URL. Python's standard HTTP libraries make it non-trivial to handled redirects by default. Do yourself a favor, don't hassle with this stuff and use the 3rd party mechanize library, which is a drop-in replacement for urllib2.

Using mechanize, your code would look like this:

import httplib
import mechanize

host = 'en.wikipedia.org'
path = '/wiki/Special:Random'
url = 'http://' + host + path # We have to pass a http:// url

# It still uses httplib.HTTPConnection, so we can debug
httplib.HTTPConnection.debuglevel = 1

request = mechanize.Request(url, headers={'User-Agent': 'Python-mechanize'}) 
response = mechanize.urlopen(request)

print response.code
# => 200
print response.geturl()
# => 'http://en.wikipedia.org/wiki/Faliszowice,_Lesser_Poland_Voivodeship'
data = response.read()

0人赞添加讨论(0) 举报

在下西门庆

5楼-- · 2019-07-08 22:30

The 302 is a redirect. It's telling you where to go in the following line:

header: Location: http://en.wikipedia.org/wiki/tuticorin_port_trust

You just need to follow the redirect.

0人赞添加讨论(0) 举报

Script always gets a 302 response when pulling ran

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间