There are seemingly a million questions involving Python Unicode Errors where the ...ordinal [is] not in range(128)
. Seemingly, the vast majority involve Python 2.x.
I know about these errors because I am currently in encoding, decoding hell. For a side-project, I scrape web pages and attempt to normalize that text data, so that it doesn't appear on our site with crazy characters. To normalize the data, I rely on HTMLParser's HTMLParser()
and entitydefs
, as well as decoding the text from whatever its original form was (string.decode('[original encoding]', 'ignore'))
and encoding it as UTF-8 (string.encode('utf-8', 'ignore')
).
Yet, seemingly, there's always a site on which my best efforts fail, raising the same old UnicodeError: ASCII decoding error...ordinal not in range(128).
It's so annoying.
I've read (here and here) that in Python 3 all text is Unicode. While I've read a lot about Unicode, because I'm not a software engineer, I don't know whether Unicode is objectively better (i.e., lower failure rate) than 2.x's default ascii encoding option. I have to think anything would be better, but I'd like if someone more expert and experienced could lend some perspective.
I'd like to know whether I should migrate to Python 3 for its (improved) processing of text scraped from the web. I am hoping that someone here can explain (or suggest resources that explain) the pros and cons of Python 3's approach to text processing. Is it better?? Is there someone who's dealt with my same problem who's already migrated to Python 3?? Would he/she recommend that I start using Python 3, if the 2to3
migration weren't an issue??
Thank you in advance for any assistance. I sure need it.
I'll speak from the point of view of a Python 2.7 user.
It's true that Python 3 introduces some big changes on the
Unicode
field. I won't say it is easier to work withencodings
in Python 3, but it's indeed more reasonable for doing i18n stuff.Like I said, I use Python 2.7 and so far I've been able to handle every
encoding
problem I've found. You just have to understand what's going on under the hood, and have a very reasonable background of whatencodings
is all about, of course: this is the best article there is to understand encodings.In that article, Joel says something that you need to keep in mind every time you encounter yourself in an
encoding
situation:Having said that, my suggestion to approach your problem with Python 2.7 would be something like this:
encoding
the web page is using (you can sense this by looking at theResponse headers
or in a field inBeautifulSoup
..decode()
the retrieved string using theencoding
you figured outdecode
, you don't have astr
object anymore, you have aunicode
object.unicode
is just an internal representation, not a real encoding, so if you want to output the content somewhere, you'll have to.encode()
it and I suggest you to useutf-8
of course.Now, some points have to be understood. Maybe the web page you're scraping is not encoding aware and it says it uses some
encoding
but doesn't stick to it. This is an error made by the webmaster, but you have to do something to figure it out. You have either 3 choices:,ignore
characters that can be problematic. Just quietly pass them.encoding
is malformedTo get
encodings
right, some amount of discipline is needed from the source and from the client. You have to develop your program right, but you need that the information about encoding and the real encoding at the source match.Python 3 improve its
unicode
handling but if you don't understand what is going on, it will probably be useless. The best thing you can do is understandencodings
(ain't that hard, again, read Joel!) and once you understand it, you'll be able to process it with Python 2.7, Python 3.3 and even PHP ;)Hope this helps!