Python urllib.request and utf8 decoding question

2019-04-01 21:32发布

I'm writing a simple Python CGI script that grabs a webpage and displays the HTML file in the web browser (acting like a proxy). Here is the script:

#!/usr/bin/env python3.0

import urllib.request

site = "http://reddit.com/"
site = urllib.request.urlopen(site)
site = site.read()
site = site.decode('utf8')

print("Content-type: text/html\n\n")
print(site)

This script works fine when run from the command line, but when it gets to viewing it with a web browser, it shows a blank page. Here is the error I get in Apache's error_log:

Traceback (most recent call last):
  File "/home/public/projects/proxy/script.cgi", line 11, in <module>
    print(site)
  File "/usr/local/lib/python3.0/io.py", line 1491, in write
    b = encoder.encode(s)
  File "/usr/local/lib/python3.0/encodings/ascii.py", line 22, in encode
    return codecs.ascii_encode(input, self.errors)[0]
UnicodeEncodeError: 'ascii' codec can't encode character '\u2019' in position 33777: ordinal not in range(128)

3条回答
放我归山
2楼-- · 2019-04-01 21:46

Rather than wrestling with the sys.stdout internals, much more straight-forward is to have the web server (1) set the CGI environment variable PYTHONIOENCODING (2) to UTF8.

For Apache2, you'll have to enable the loading of mod_env.so. In a Debian installation, that equates to creating a symlink in /etc/apache2/mods-enabled to /etc/apache2/mods-available/env.load, and creating a configuration /etc/apache2/conf-available/env.conf, and a symlink in /etc/apache2/conf-enabled to that, if you wish to keep the structure the same as with all the other module loader and configs.

The contents of the env_mod.conf file I created is:

<IfModule mod_env.c>
  SetEnv PYTHONIOENCODING UTF8
</IfModule>

Before I did this, my script was reporting that sys.stdout.encoding was "ANSI ..." and erroring out when trying to print a string containing Unicode characters, afterwards, it was "UTF8", and correctly sending the desired UTF-8 to the browser.

(1) http://httpd.apache.org/docs/2.2/howto/cgi.html#env

(2) http://docs.python.org/3.3/library/sys.html#sys.stdin

查看更多
▲ chillily
3楼-- · 2019-04-01 21:52

It could be that the site you are trying to open is not UTF-8 encoded. Try passing "iso-8859-1" to the decode method.

查看更多
Animai°情兽
4楼-- · 2019-04-01 22:05

When you print it at the command line, you print a Unicode string to the terminal. The terminal has an encoding, so Python will encode your Unicode string to that encoding. This will work fine.

When you use it in CGI, you end up printing to stdout, which does not have an encoding. Python therefore tries to encode the string with ASCII. This fails, as ASCII doesn't contain all the characters you try to print, so you get the above error.

The fix for this is to encode your string into some sort of encoding (why not UTF8?) and also say so in the header.

So something like this:

sys.stdout.buffer.write(b"Content-type: text/html;encoding=UTF-8\n\n") # Not 100% sure about the spelling.
sys.stdout.buffer.write(site.encode('UTF8'))

Under Python 2, this would work as well:

print("Content-type: text/html;encoding=UTF-8\n\n") # Not 100% sure about the spelling.
print(site.encode('UTF8'))

But under Python 3 the encoded data in bytes, so it won't print well.

Of course you'll notice that you now first decode from UTF8 and then re-encode it. You don't need to do that, strictly speaking. But if you want to modify the HTML in between, it may actually be a good idea to do so, and keep all modifications in Unicode.

查看更多
登录 后发表回答