I'm writing a simple Python CGI script that grabs a webpage and displays the HTML file in the web browser (acting like a proxy). Here is the script:
#!/usr/bin/env python3.0
import urllib.request
site = "http://reddit.com/"
site = urllib.request.urlopen(site)
site = site.read()
site = site.decode('utf8')
print("Content-type: text/html\n\n")
print(site)
This script works fine when run from the command line, but when it gets to viewing it with a web browser, it shows a blank page. Here is the error I get in Apache's error_log:
Traceback (most recent call last):
File "/home/public/projects/proxy/script.cgi", line 11, in <module>
print(site)
File "/usr/local/lib/python3.0/io.py", line 1491, in write
b = encoder.encode(s)
File "/usr/local/lib/python3.0/encodings/ascii.py", line 22, in encode
return codecs.ascii_encode(input, self.errors)[0]
UnicodeEncodeError: 'ascii' codec can't encode character '\u2019' in position 33777: ordinal not in range(128)
When you print it at the command line, you print a Unicode string to the terminal. The terminal has an encoding, so Python will encode your Unicode string to that encoding. This will work fine.
When you use it in CGI, you end up printing to stdout, which does not have an encoding. Python therefore tries to encode the string with ASCII. This fails, as ASCII doesn't contain all the characters you try to print, so you get the above error.
The fix for this is to encode your string into some sort of encoding (why not UTF8?) and also say so in the header.
So something like this:
sys.stdout.buffer.write(b"Content-type: text/html;encoding=UTF-8\n\n") # Not 100% sure about the spelling.
sys.stdout.buffer.write(site.encode('UTF8'))
Under Python 2, this would work as well:
print("Content-type: text/html;encoding=UTF-8\n\n") # Not 100% sure about the spelling.
print(site.encode('UTF8'))
But under Python 3 the encoded data in bytes, so it won't print well.
Of course you'll notice that you now first decode from UTF8 and then re-encode it. You don't need to do that, strictly speaking. But if you want to modify the HTML in between, it may actually be a good idea to do so, and keep all modifications in Unicode.
It could be that the site you are trying to open is not UTF-8 encoded. Try passing "iso-8859-1"
to the decode method.
Rather than wrestling with the sys.stdout
internals, much more straight-forward is to have the web server (1) set the CGI environment variable PYTHONIOENCODING
(2) to UTF8
.
For Apache2, you'll have to enable the loading of mod_env.so
. In a Debian installation, that equates to creating a symlink in /etc/apache2/mods-enabled
to /etc/apache2/mods-available/env.load
, and creating a configuration /etc/apache2/conf-available/env.conf
, and a symlink in /etc/apache2/conf-enabled
to that, if you wish to keep the structure the same as with all the other module loader and configs.
The contents of the env_mod.conf
file I created is:
<IfModule mod_env.c>
SetEnv PYTHONIOENCODING UTF8
</IfModule>
Before I did this, my script was reporting that sys.stdout.encoding
was "ANSI ..."
and erroring out when trying to print a string containing Unicode characters, afterwards, it was "UTF8"
, and correctly sending the desired UTF-8 to the browser.
(1) http://httpd.apache.org/docs/2.2/howto/cgi.html#env
(2) http://docs.python.org/3.3/library/sys.html#sys.stdin