How do I properly work with unicode characters in

2019-05-24 03:12发布

I'm working on a python plugin for Google Quick Search Box, and it's doing some odd things with non-ascii characters. It seems like the code works fine up until I try constructing a string containing the non-ascii characters (ü has been my test character). I am using the following code snippet for the construction, with new_task as the variable that is being input from GQSB.

the_sig = ("%sapi_key%sauth_token%smethod%sname%sparse%stimeline%s" %
           (api_secret, api_key, the_token, method, new_task, doParse, timeline))

It's giving me this error:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)

I am understanding correctly, this is because I am trying to string together a unicode character inside an ascii string. Everything I could find told me to declare the encoding at the top with this:

# -*- coding: iso-8859-15 -*-

Which I have. And when I pull the code snippet that constructs the string into a new script, it works just fine. But for some reason, int he context of the rest of the code, it fails, every time. The only thing I can think of is that it is because it's inside it's own class, but that doesn't make any sense to me.

The full code can be found on GitHub here

Thanks in advance for any help. I am stumped on this one.

3条回答
唯我独甜
2楼-- · 2019-05-24 03:17

This is a bit beyond my expertise, but I think # -*- coding: iso-8859-15 -*- at the top declares the text encoding that your Python source file is saved in.

Is it really saved in iso-8859-15?

查看更多
Melony?
3楼-- · 2019-05-24 03:19

There are a few things you should do to fix this.

  1. Convert all string literal that contain non-ASCII characters to Unicode literals. Example: u'über'.

  2. Do intermediate processing on Unicode. In other words, if you receive an encoded string (no matter the encoding), decode it to Unicode before working on it. Example:

    s = utf8_string.decode('utf8') + latin1_string.decode('latin1')
    
  3. When outputting the string or sending it somewhere, encode it with an encoding that your receiver understands. Example: send(s.encode('utf8')).

Complete example:

input1 = get_possibly_nonascii_input().decode('iso-8859-1')
input2 = get_possibly_nonascii_input().decode('iso-8859-1')
input3 = u'üvw'

s =  u'%s -> %s' % (input3, (input1 + input2).upper())

send_output(s.encode('utf8'))
查看更多
家丑人穷心不美
4楼-- · 2019-05-24 03:32

I guess you're using Python 2.x.

The file encoding declaration specifies how string literals are read by the interpreter.

You should handle all strings as unicode values, not str ones. If you read a str from the outside world, you should decode it to unicode explicitely. The same applies to outputting strings.

# -*- coding: utf-8 -*-
u_dia_str = '\xc3\xbc'   # str
lambda_unicode = u'λ'    # unicode

# input value
u_dia = u_dia_str.decode('utf-8')

sig_unicode = u'%s%s' % (u_dia, lambda_unicode)
# => u'üλ'

# output value
sig_str = sig_unicode.encode('utf-8')
# => '\xc3\xbc\xce\xbb'
查看更多
登录 后发表回答