How to use Python to retrieve xml page that requir

2019-06-01 10:16发布

When I access a page on an IIS server to retrieve xml, using a query parameter through the browser (using the http in the below example) I get a pop-up login dialog for username and password (appears to be a system standard dialog/form). and once submitted the data arrives. as an xml page.

How do I handle this with urllib? when I do the following, I never get prompted for a uid/psw.. I just get a traceback indicating the server (correctly ) id's me as not authorized. Using python 2.7 in Ipython notebook

f = urllib.urlopen("http://www.nalmls.com/SERetsHuntsville/Search.aspx?SearchType=Property&Class=RES&StandardNames=0&Format=COMPACT&Query=(DATE_MODIFIED=2012-09-28T00:00:00%2B)&Limit=10")
s = f.read()
f.close()

Pointers to doc also appreciated! did not find this exact use case.

I plan to parse the xml to csv if that makes a difference.

3条回答
霸刀☆藐视天下
2楼-- · 2019-06-01 10:30

There are many ways to do it but i suggest you start with urllib2 and it's batteries included.

import urllib2, base64

req = urllib2.Request("http://webpage.com//user")
b64str = base64.encodestring('%s:%s' % (username, password)).replace('\n', '')
request.add_header("Authorization", "Basic %s" % b64str)   
result = urllib2.urlopen(req)

You can use requests, beautifulsoup,mechanize or selenium if your task gets harder. Googling will give you enough examples for each one of these,

查看更多
迷人小祖宗
3楼-- · 2019-06-01 10:39

You are dealing with http authentication. I've always found it tricky to get working quickly with the urllib library. The requests python package makes it super simple.

url = "http://www.nalmls.com/SERetsHuntsville/Search.aspx?SearchType=Property&Class=RES&StandardNames=0&Format=COMPACT&Query=(DATE_MODIFIED=2012-09-28T00:00:00%2B)&Limit=10"
r = requests.get(url, auth=('user', 'pass'))
page = r.text

If you look at the headers for that url you can see that it is using digest authentication:

{'content-length': '1893', 'x-powered-by': 'ASP.NET', 'x-aspnet-version': '4.0.30319', 'server': 'Microsoft-IIS/7.5', 'cache-control': 'private', 'date': 'Fri, 05 Oct 2012 18:20:54 GMT', 'content-type': 'text/html; charset=utf-8', 'www-authenticate': 'Digest realm="Solid Earth", nonce="MTAvNS8yMDEyIDE6MjE6MjUgUE0", opaque="0000000000000000", stale=false, algorithm=MD5, qop="auth"'}

So you will need:

from requests.auth import HTTPDigestAuth
r = requests.get(url, auth=HTTPDigestAuth('user', 'pass'))
查看更多
该账号已被封号
4楼-- · 2019-06-01 10:44

This can be done in a couple of ways:

  1. Use urllib/urllib2 and requests as others have suggested
  2. Use Mechanize to simulate manual form-filling and get back the response
查看更多
登录 后发表回答