How do I treat an ASCII string as unicode and unes

For example, if I have a unicode string, I can encode it as an ASCII string like so:

>>> u'\u003cfoo/\u003e'.encode('ascii')
'<foo/>'

However, I have e.g. this ASCII string:

'\u003foo\u003e'

... that I want to turn into the same ASCII string as in my first example above:

'<foo/>'

标签： python unicode ascii

5条回答

在下西门庆

2楼-- · 2019-01-07 11:54

At some point you will run into issues when you encounter special characters like Chinese characters or emoticons in a string you want to decode i.e. errors that look like this:

UnicodeEncodeError: 'ascii' codec can't encode characters in position 109-123: ordinal not in range(128)

For my case (twitter data processing), I decoded as follows to allow me to see all characters with no errors

>>> s = '\u003cfoo\u003e'
>>> s.decode( 'unicode-escape' ).encode( 'utf-8' )
>>> <foo>

0人赞添加讨论(0) 举报

Explosion°爆炸

3楼-- · 2019-01-07 12:08

On Python 2.5 the correct encoding is "unicode_escape", not "unicode-escape" (note the underscore).

I'm not sure if the newer version of Python changed the unicode name, but here only worked with the underscore.

Anyway, this is it.

0人赞添加讨论(0) 举报

\"骚年 ilove

4楼-- · 2019-01-07 12:14

It took me a while to figure this one out, but this page had the best answer:

>>> s = '\u003cfoo/\u003e'
>>> s.decode( 'unicode-escape' )
u'<foo/>'
>>> s.decode( 'unicode-escape' ).encode( 'ascii' )
'<foo/>'

There's also a 'raw-unicode-escape' codec to handle the other way to specify Unicode strings -- check the "Unicode Constructors" section of the linked page for more details (since I'm not that Unicode-saavy).

EDIT: See also Python Standard Encodings.

0人赞添加讨论(0) 举报

我只想做你的唯一

5楼-- · 2019-01-07 12:15

Ned Batchelder said:

It's a little dangerous depending on where the string is coming from, but how about:
>>> s = '\u003cfoo\u003e'
>>> eval('u"'+s.replace('"', r'\"')+'"').encode('ascii')
'<foo>'

Actually this method can be made safe like so:

>>> s = '\u003cfoo\u003e'
>>> s_unescaped = eval('u"""'+s.replace('"', r'\"')+'-"""')[:-1]

Mind the triple-quote string and the dash right before the closing 3-quotes.

Using a 3-quoted string will ensure that if the user enters ' \\" ' (spaces added for visual clarity) in the string it would not disrupt the evaluator;
The dash at the end is a failsafe in case the user's string ends with a ' \" ' . Before we assign the result we slice the inserted dash with [:-1]

So there would be no need to worry about what the users enter, as long as it is captured in raw format.

0人赞添加讨论(0) 举报

ゆ、 Hurt°

6楼-- · 2019-01-07 12:16

It's a little dangerous depending on where the string is coming from, but how about:

>>> s = '\u003cfoo\u003e'
>>> eval('u"'+s.replace('"', r'\"')+'"').encode('ascii')
'<foo>'

0人赞添加讨论(0) 举报

How do I treat an ASCII string as unicode and unes

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间