Unicode转换为ASCII没有在Python中的错误Unicode转换为ASCII没有在Pyth

2019-05-09 04:36发布

站内文章 / 前沿技术

79 0

贼婆χ

女 | 书童

私信

我的代码只是擦伤一个网页，然后将其转换为Unicode。

html = urllib.urlopen(link).read()
html.encode("utf8","ignore")
self.response.out.write(html)

但我得到一个UnicodeDecodeError ：

Traceback (most recent call last):
  File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/ext/webapp/__init__.py", line 507, in __call__
    handler.get(*groups)
  File "/Users/greg/clounce/main.py", line 55, in get
    html.encode("utf8","ignore")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 2818: ordinal not in range(128)

我认为，这意味着在HTML包含在Unicode的一些错误，企图形成某处。 我可以放下手头的代码字节引起的问题，而不是得到一个错误？

Answer 1:

2018更新：

截至2月2018，使用按压喜欢gzip已成为颇为流行（周围的所有网站的73％的人使用它，包括大型网站像谷歌，YouTube以及雅虎，维基百科，书签交易，堆栈溢出交换网络中的网站）。
如果你做一个简单的解码像用gzip压缩的响应原来的答案，你会得到这样的错误或与此类似：

UnicodeDecodeError错误：“UTF8”编解码器不能在位置1解码字节0x8b：意外的代码字节

为了解码gzpipped响应，你需要添加下面的模块（在Python 3）：

import gzip
import io

注：在Python 2，你会用StringIO ，而不是io

然后你就可以解析出的内容是这样的：

response = urlopen("https://example.com/gzipped-ressource")
buffer = io.BytesIO(response.read()) # Use StringIO.StringIO(response.read()) in Python 2
gzipped_file = gzip.GzipFile(fileobj=buffer)
decoded = gzipped_file.read()
content = decoded.decode("utf-8") # Replace utf-8 with the source encoding of your requested resource

此代码读取响应，并把该字节的缓冲区。所述gzip模块然后读取使用缓冲GZipFile的功能。在此之后，gzip文件可被再次读入字节和解码，以在端部通常可读文本。

从2010年原来的答案：

我们可以得到用于实际值link ？

此外，我们这里通常会遇到这样的问题，当我们试图.encode()已编码的字节串。所以，你可以尝试先将其作为解码

html = urllib.urlopen(link).read()
unicode_str = html.decode(<source encoding>)
encoded_str = unicode_str.encode("utf8")

举个例子：

html = '\xa0'
encoded_str = html.encode("utf8")

与失败

UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 0: ordinal not in range(128)

而：

html = '\xa0'
decoded_str = html.decode("windows-1252")
encoded_str = decoded_str.encode("utf8")

成功没有错误。请注意，“窗口1252”是我作为一个例子。从我得到这个chardet的，它有0.5的信心，这是正确的！（当然，如用1个字符长度的字符串给出，你能指望什么）你应该改变以字节字符串的编码从返回.urlopen().read()什么适用于您检索到的内容。

我看到有另外一个问题是， .encode()字符串方法返回修改后的字符串并不会修改到位源。因此，它是一种无用的有self.response.out.write(html)的HTML不是从了Html.Encode编码字符串（如果那是你最初瞄准）。

作为伊格纳西奥的建议，检查源网页的返回的字符串从实际编码read() 这无论是在Meta标签中的一个或在响应的ContentType标头。使用则作为参数.decode()

但是不要注意它不应该假定其他开发商有足够的责任，以确保头部和/或元字符集声明的实际内容相匹配。（这是一个皮塔饼，是的，我应该知道，我是其中的一个前）。

Answer 2:

>>> u'aあä'.encode('ascii', 'ignore')
'a'

编辑：

解码你回来，在适当的使用任一字符集的字符串meta在响应或在标签Content-Type头，然后对其进行编码。

该方法encode()接受其他值作为“忽略”。例如： '代替'， 'xmlcharrefreplace'， 'backslashreplace'。见https://docs.python.org/3/library/stdtypes.html#str.encode

Answer 3:

作为扩展伊格纳西奥巴斯克斯 - 艾布拉姆斯答案

>>> u'aあä'.encode('ascii', 'ignore')
'a'

有时希望从字符除去口音和打印碱形式。这可以实现

>>> import unicodedata
>>> unicodedata.normalize('NFKD', u'aあä').encode('ascii', 'ignore')
'aa'

您可能还需要翻译其他字符（如标点符号）到离他们最近的等价物，例如编码时的右单引号Unicode字符没有得到转换成ASCII APOSTROPHE。

>>> print u'\u2019'
’
>>> unicodedata.name(u'\u2019')
'RIGHT SINGLE QUOTATION MARK'
>>> u'\u2019'.encode('ascii', 'ignore')
''
# Note we get an empty string back
>>> u'\u2019'.replace(u'\u2019', u'\'').encode('ascii', 'ignore')
"'"

虽然有更有效的方式来做到这一点。看到这个问题的详细信息，如果是Python的数据库“这个最好的Unicode ASCII”？

Answer 4:

使用unidecode -它甚至转换怪异字符的ASCII瞬间，甚至转换到中国拼音ASCII。

$ pip install unidecode

然后：

>>> from unidecode import unidecode
>>> unidecode(u'北京')
'Bei Jing'
>>> unidecode(u'Škoda')
'Skoda'

Answer 5:

我用这个辅助函数在所有我的项目。如果不能转换unicode的，它忽略它。这关系到一个Django库，但有一点研究，你可以绕过它。

from django.utils import encoding

def convert_unicode_to_string(x):
    """
    >>> convert_unicode_to_string(u'ni\xf1era')
    'niera'
    """
    return encoding.smart_str(x, encoding='ascii', errors='ignore')

我用这个后不再获得任何Unicode错误。

Answer 6:

对于破游戏机一样cmd.exe和HTML输出，你可以随时使用：

my_unicode_string.encode('ascii','xmlcharrefreplace')

同时使这些纯ASCII 和 HTML见诸报端这样会保留所有的非ASCII字符。

警告： 如果您使用此在生产代码，以避免再错误最可能有错在你的代码 。唯一有效的使用案例，这是打印到在HTML中的上下文非Unicode控制台或易于转换为HTML实体。

最后，如果你是在Windows和使用的cmd.exe，你可以键入chcp 65001启用UTF-8输出（与龙力控制台字体的作品）。您可能需要添加myUnicodeString.encode('utf8')

Answer 7:

你写了“”“我认为，这意味着在HTML包含unicode的一些错误，建制尝试的地方。”“”

该HTML预计不会包含任何一种“在统一的尝试”，以及形成与否。必要性它必须包含一些编码，通常是提供了正面的编码Unicode字符...寻找“字符集”。

你似乎是假设字符集为UTF-8？理由是什么？这是在错误消息中显示的“\ XA0”字节表明你可能有一个单字节字符集如CP1252。

如果你不能在HTML开始得到任何意义了声明，请尝试使用chardet的找出可能的编码是什么。

你为什么用“正则表达式”你的问题？

更新您与非的问题代替你的整个问题后：

html = urllib.urlopen(link).read()
# html refers to a str object. To get unicode, you need to find out
# how it is encoded, and decode it.

html.encode("utf8","ignore")
# problem 1: will fail because html is a str object;
# encode works on unicode objects so Python tries to decode it using 
# 'ascii' and fails
# problem 2: even if it worked, the result will be ignored; it doesn't 
# update html in situ, it returns a function result.
# problem 3: "ignore" with UTF-n: any valid unicode object 
# should be encodable in UTF-n; error implies end of the world,
# don't try to ignore it. Don't just whack in "ignore" willy-nilly,
# put it in only with a comment explaining your very cogent reasons for doing so.
# "ignore" with most other encodings: error implies that you are mistaken
# in your choice of encoding -- same advice as for UTF-n :-)
# "ignore" with decode latin1 aka iso-8859-1: error implies end of the world.
# Irrespective of error or not, you are probably mistaken
# (needing e.g. cp1252 or even cp850 instead) ;-)

Answer 8:

如果有一个字符串line ，则可以使用.encode([encoding], [errors='strict'])的字符串的方法来转换的编码类型。

line = 'my big string'

line.encode('ascii', 'ignore')

有关在Python处理ASCII和Unicode的详细信息，这是一个非常有用的网站： https://docs.python.org/2/howto/unicode.html

Answer 9:

我认为答案是有，但只有在点点滴滴，这使得它难以迅速解决问题，如

UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 2818: ordinal not in range(128)

让我们举个例子，假设我有一个具有以下形式的一些数据文件（包含ASCII和非ASCII字符）

17年1月10日，21:36 - 土地：欢迎ï¿½ï¿½

我们要忽略并保留唯一ASCII字符。

此代码将做到：

import unicodedata
fp  = open(<FILENAME>)
for line in fp:
    rline = line.strip()
    rline = unicode(rline, "utf-8")
    rline = unicodedata.normalize('NFKD', rline).encode('ascii','ignore')
    if len(rline) != 0:
        print rline

和类型（rline）会给你

>type(rline) 
<type 'str'>

Answer 10:

unicodestring = '\xa0'

decoded_str = unicodestring.decode("windows-1252")
encoded_str = decoded_str.encode('ascii', 'ignore')

对我的作品

Answer 11:

看起来你正在使用Python 2.x版本的Python 2.x中默认为ASCII，它不知道关于Unicode。因此例外。

只需粘贴家当后，下面的线，它会工作

# -*- coding: utf-8 -*-

文章来源: Convert Unicode to ASCII without errors in Python

标签： python unicode utf-8 character-encoding ascii

贼婆χ

女 | 书童

私信

收藏的人(0)

Ta的文章更多文章

0条评论

还没有人评论过~