I am trying to read some utf-8 files from the addresses in the code below. It works for most of them, but for some files the urllib2 (and urllib) is unable to read.
The obvious answer here is that the second file is corrupt, but the strange thing is that IE reads them both with no problem at all. The code has been tested on both XP and Linux, with identical results. Any sugestions?
import urllib2
#This works:
f=urllib2.urlopen("http://www.gutenberg.org/cache/epub/145/pg145.txt")
line=f.readline()
print "this works: %s)" %(line)
line=unicode(line,'utf-8') #... works fine
#This doesn't
f=urllib2.urlopen("http://www.gutenberg.org/cache/epub/144/pg144.txt")
line=f.readline()
print "this doesn't: %s)" %(line)
line=unicode(line,'utf-8')#...causes an exception:
>>> f=urllib2.urlopen("http://www.gutenberg.org/cache/epub/144/pg144.txt")
>>> f.headers.dict
{'content-length': '304513', ..., 'content-location': 'pg144.txt.utf8.gzip', 'content-encoding': 'gzip', ..., 'content-type': 'text/plain; charset=utf-8'}
Either set a header that prevents the site sending a gzip-encoded response, or decode it first.
The URL you're asking for seems to refer to a private cache. Try http://www.gutenberg.org/files/144/144-0.txt instead (found at http://www.gutenberg.org/ebooks/144).
If you really want to use the /cache/
URL: The server is sending you gzipped data, not unicode. urllib2
does not ask for gzipped data and doesn't decode it, which is correct behavior.
See this question for how to uncompress it.
You know it's not a solution, but you should look http://pypi.python.org/pypi/requests library, no matter if you still want to use urllib can look the source code of Requests, to understand how it works with utf-8 strings .