I'm pulling data out of a Google doc, processing it, and writing it to a file (that eventually I will paste into a Wordpress page).
It has some non-ASCII symbols. How can I convert these safely to symbols that can be used in HTML source?
Currently I'm converting everything to Unicode on the way in, joining it all together in a Python string, then doing:
import codecs
f = codecs.open('out.txt', mode="w", encoding="iso-8859-1")
f.write(all_html.encode("iso-8859-1", "replace"))
There is an encoding error on the last line:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 12286: ordinal not in range(128)
Partial solution:
This Python runs without an error:
row = [unicode(x.strip()) if x is not None else u'' for x in row]
all_html = row[0] + "<br/>" + row[1]
f = open('out.txt', 'w')
f.write(all_html.encode("utf-8")
But then if I open the actual text file, I see lots of symbols like:
Qur’an
Maybe I need to write to something other than a text file?
Deal exclusively with unicode objects as much as possible by decoding things to unicode objects when you first get them and encoding them as necessary on the way out.
If your string is actually a unicode object, you'll need to convert it to a unicode-encoded string object before writing it to a file:
When you read that file again, you'll get a unicode-encoded string that you can decode to a unicode object:
Unicode string handling is standardized in Python 3.
You only need to open file in utf-8
How to print unicode characters into a file:
Save this to file: foo.py:
Run it and pipe output to file:
Open tmp.txt and look inside, you see this:
Thus you have saved unicode e with a obfuscation mark on it to a file.
The file opened by
codecs.open
is a file that takesunicode
data, encodes it iniso-8859-1
and writes it to the file. However, what you try to write isn'tunicode
; you takeunicode
and encode it iniso-8859-1
yourself. That's what theunicode.encode
method does, and the result of encoding a unicode string is a bytestring (astr
type.)You should either use normal
open()
and encode the unicode yourself, or (usually a better idea) usecodecs.open()
and not encode the data yourself.Preface: will your viewer work?
Make sure your viewer/editor/terminal (however you are interacting with your utf-8 encoded file) can read the file. This is frequently an issue on Windows, for example, Notepad.
In Python 2, use
open
from theio
module (this is the same as the builtinopen
in Python 3):Best practice, in general, use
UTF-8
for writing to files (we don't even have to worry about byte-order with utf-8).utf-8 is the most modern and universally usable encoding - it works in all web browsers, most text-editors (see your settings if you have issues) and most terminals/shells.
On Windows, you might try
utf-16le
if you're limited to viewing output in Notepad (or another limited viewer).And just open it with the context manager and write your unicode characters out:
Example using many Unicode characters
Here's an example that attempts to map every possible character up to three bits wide (4 is the max, but that would be going a bit far) from the digital representation (in integers) to an encoded printable output, along with its name, if possible (put this into a file called
uni.py
):This should run in the order of about a minute, and you can view the data file, and if your file viewer can display unicode, you'll see it. Information about the categories can be found here. Based on the counts, we can probably improve our results by excluding the Cn and Co categories, which have no symbols associated with them.
It will display the hexadecimal mapping, category, symbol (unless can't get the name, so probably a control character), and the name of the symbol. e.g.
I recommend
less
on Unix or Cygwin (don't print/cat the entire file to your output):e.g. will display similar to the following lines which I sampled from it using Python 2 (unicode 5.2):
My Python 3.5 from Anaconda has unicode 8.0, I would presume most 3's would.
In Python 2.6+, you could use
io.open()
that is default (builtinopen()
) on Python 3:It might be more convenient if you need to write the text incrementally (you don't need to call
unicode_text.encode(character_encoding)
multiple times). Unlikecodecs
module,io
module has a proper universal newlines support.