Is there a python library that provides translation of multi-byte non-ASCII characters into some reasonable form of 7-bit displayable ASCII. This is intended to avoid hard-coding the charmap
as given in the answer to Translating multi-byte characters into 7-bit ASCII in Python
EDIT: I am currently using Python 2.7.11 or greater and not yet Python 3 but answers giving Python 3 solutions will be considered and found helpful.
The reason is this: As I do the translation manually, I will miss some:
My script is:
#!/bin/bash
# -*- mode: python; -*-
import os
import re
import requests
url = "https://system76.com/laptops/kudu"
#
# Load the text from request as a true unicode string:
#
r = requests.get(url)
r.encoding = "UTF-8"
data = r.text # ok, data is a true unicode string
# translate offending characters in unicode:
charmap = {
0x2014: u'-', # em dash
0x201D: u'"', # comma quotation mark, double
# etc.
}
data = data.translate(charmap)
tdata = data.encode('ascii')
The error I get is:
./simple_wget
Traceback (most recent call last):
File "./simple_wget.py", line 25, in <module>
tdata = data.encode('ascii')
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in position 10166: ordinal not in range(128)
This will be a never-ending battle to update the charmap
for newly discovered characters. Is there a python library that provides this charmap so I don't have to hardcode it in this manner?
str.encode()
has an optional 'error' parameter that can replace un-encodable characters instead of throwing an error. Is that what you are looking for?https://docs.python.org/3/howto/unicode.html#converting-to-bytes
You may consider the unicodedata python package. I think one of the methods you may find interesting is
normalize
(see also example of usage given by peterbe.come):(Note: This answer pertains to Python 2.7.11+.)
The answer at https://stackoverflow.com/a/1701378/257924 refers to the Unidecode package and is what I was looking for. In using that package, I also discovered the ultimate source of my confusion which is elaborated in-depth at https://pythonhosted.org/kitchen/unicode-frustrations.html#frustration-3-inconsistent-treatment-of-output and specifically this section:
The following is my demonstration script to use it. The characters listed in the
names
variable are the characters I do need to have translated into something readable, and not removed, for the types of web pages I am analyzing.Sample output of the above script is:
The following more elaborate script loads several web pages with many unicode characters. See the comments in the script below:
The gist showing the output of the above script. This shows the execution of the Linux
diff
command on the "old" and "new" html files so as to see the translations. There is going to be mistranslation of languages like German etc., but that is fine for my purposes of getting some lossy translation of single and double quote types of characters and dash-like characters.