Is Python 3.3 better than 2.7 for Decoding and Re-

2019-06-17 07:53发布

There are seemingly a million questions involving Python Unicode Errors where the ...ordinal [is] not in range(128). Seemingly, the vast majority involve Python 2.x.

I know about these errors because I am currently in encoding, decoding hell. For a side-project, I scrape web pages and attempt to normalize that text data, so that it doesn't appear on our site with crazy characters. To normalize the data, I rely on HTMLParser's HTMLParser() and entitydefs, as well as decoding the text from whatever its original form was (string.decode('[original encoding]', 'ignore')) and encoding it as UTF-8 (string.encode('utf-8', 'ignore')).

Yet, seemingly, there's always a site on which my best efforts fail, raising the same old UnicodeError: ASCII decoding error...ordinal not in range(128). It's so annoying.

I've read (here and here) that in Python 3 all text is Unicode. While I've read a lot about Unicode, because I'm not a software engineer, I don't know whether Unicode is objectively better (i.e., lower failure rate) than 2.x's default ascii encoding option. I have to think anything would be better, but I'd like if someone more expert and experienced could lend some perspective.

I'd like to know whether I should migrate to Python 3 for its (improved) processing of text scraped from the web. I am hoping that someone here can explain (or suggest resources that explain) the pros and cons of Python 3's approach to text processing. Is it better?? Is there someone who's dealt with my same problem who's already migrated to Python 3?? Would he/she recommend that I start using Python 3, if the 2to3 migration weren't an issue??

Thank you in advance for any assistance. I sure need it.

1条回答
不美不萌又怎样
2楼-- · 2019-06-17 08:42

I'll speak from the point of view of a Python 2.7 user.

It's true that Python 3 introduces some big changes on the Unicode field. I won't say it is easier to work with encodings in Python 3, but it's indeed more reasonable for doing i18n stuff.

Like I said, I use Python 2.7 and so far I've been able to handle every encoding problem I've found. You just have to understand what's going on under the hood, and have a very reasonable background of what encodings is all about, of course: this is the best article there is to understand encodings.

In that article, Joel says something that you need to keep in mind every time you encounter yourself in an encoding situation:

It does not make sense to have a string without knowing what encoding it uses.

Having said that, my suggestion to approach your problem with Python 2.7 would be something like this:

  1. Read Joel's article of course (great reading and will take only 30 minutes or less)
  2. Figure out what encoding the web page is using (you can sense this by looking at the Response headers or in a field in BeautifulSoup.
  3. .decode() the retrieved string using the encoding you figured out
  4. When you decode, you don't have a str object anymore, you have a unicode object.
  5. unicode is just an internal representation, not a real encoding, so if you want to output the content somewhere, you'll have to .encode() it and I suggest you to use utf-8 of course.

Now, some points have to be understood. Maybe the web page you're scraping is not encoding aware and it says it uses some encoding but doesn't stick to it. This is an error made by the webmaster, but you have to do something to figure it out. You have either 3 choices:

  1. ,ignore characters that can be problematic. Just quietly pass them.
  2. There are good python libraries that try to figure out what encoding a string is using. Those are very accurate but of course, not a silver bullet. They can fail to guess, specially when the encoding is malformed
  3. Get angry and drop the project ;) (I really don't recommend this one)

To get encodings right, some amount of discipline is needed from the source and from the client. You have to develop your program right, but you need that the information about encoding and the real encoding at the source match.

Python 3 improve its unicode handling but if you don't understand what is going on, it will probably be useless. The best thing you can do is understand encodings (ain't that hard, again, read Joel!) and once you understand it, you'll be able to process it with Python 2.7, Python 3.3 and even PHP ;)

Hope this helps!

查看更多
登录 后发表回答