How do I treat an ASCII string as unicode and unes

2019-01-07 11:33发布

For example, if I have a unicode string, I can encode it as an ASCII string like so:

>>> u'\u003cfoo/\u003e'.encode('ascii')
'<foo/>'

However, I have e.g. this ASCII string:

'\u003foo\u003e'

... that I want to turn into the same ASCII string as in my first example above:

'<foo/>'

5条回答
在下西门庆
2楼-- · 2019-01-07 11:54

At some point you will run into issues when you encounter special characters like Chinese characters or emoticons in a string you want to decode i.e. errors that look like this:

UnicodeEncodeError: 'ascii' codec can't encode characters in position 109-123: ordinal not in range(128)

For my case (twitter data processing), I decoded as follows to allow me to see all characters with no errors

>>> s = '\u003cfoo\u003e'
>>> s.decode( 'unicode-escape' ).encode( 'utf-8' )
>>> <foo>
查看更多
Explosion°爆炸
3楼-- · 2019-01-07 12:08

On Python 2.5 the correct encoding is "unicode_escape", not "unicode-escape" (note the underscore).

I'm not sure if the newer version of Python changed the unicode name, but here only worked with the underscore.

Anyway, this is it.

查看更多
\"骚年 ilove
4楼-- · 2019-01-07 12:14

It took me a while to figure this one out, but this page had the best answer:

>>> s = '\u003cfoo/\u003e'
>>> s.decode( 'unicode-escape' )
u'<foo/>'
>>> s.decode( 'unicode-escape' ).encode( 'ascii' )
'<foo/>'

There's also a 'raw-unicode-escape' codec to handle the other way to specify Unicode strings -- check the "Unicode Constructors" section of the linked page for more details (since I'm not that Unicode-saavy).

EDIT: See also Python Standard Encodings.

查看更多
我只想做你的唯一
5楼-- · 2019-01-07 12:15

Ned Batchelder said:

It's a little dangerous depending on where the string is coming from, but how about:

>>> s = '\u003cfoo\u003e'
>>> eval('u"'+s.replace('"', r'\"')+'"').encode('ascii')
'<foo>'

Actually this method can be made safe like so:

>>> s = '\u003cfoo\u003e'
>>> s_unescaped = eval('u"""'+s.replace('"', r'\"')+'-"""')[:-1]

Mind the triple-quote string and the dash right before the closing 3-quotes.

  1. Using a 3-quoted string will ensure that if the user enters ' \\" ' (spaces added for visual clarity) in the string it would not disrupt the evaluator;
  2. The dash at the end is a failsafe in case the user's string ends with a ' \" ' . Before we assign the result we slice the inserted dash with [:-1]

So there would be no need to worry about what the users enter, as long as it is captured in raw format.

查看更多
ゆ 、 Hurt°
6楼-- · 2019-01-07 12:16

It's a little dangerous depending on where the string is coming from, but how about:

>>> s = '\u003cfoo\u003e'
>>> eval('u"'+s.replace('"', r'\"')+'"').encode('ascii')
'<foo>'
查看更多
登录 后发表回答