python raw_input odd behavior with accents contain

I'm writing a program that asks the user for input that contains accents. The user input string is tested to see if it matches a string declared in the program. As you can see below, my code is not working:

code

# -*- coding: utf-8 -*-

testList = ['má']
myInput = raw_input('enter something here: ')

print myInput, repr(myInput)
print testList[0], repr(testList[0])
print myInput in testList

output in eclipse with pydev

enter something here: má
m√° 'm\xe2\x88\x9a\xc2\xb0'
má 'm\xc3\xa1'
False

output in IDLE

enter something here: má
má u'm\xe1'
má 'm\xc3\xa1'

Warning (from warnings module):
  File "/Users/ryanculkin/Desktop/delete.py", line 8
    print myInput in testList
UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
False

How can I get my code to print True when comparing the two strings?

Additionally, I note that the result of running this code on the same input is different depending on whether I use eclipse or IDLE. Why is this? My eventual goal is to put my program on the web; is there anything that I need to be aware of, since the result seems to be so volatile?

标签： python unicode diacritics raw-input

3条回答

爷、活的狠高调

2楼-- · 2019-05-11 05:53

Just to note, you have a difference from IDLE vs. PyDev because PyDev will set the PYTHONIOENCODING to the encoding in your launch configuration > common > encoding. And will also do a sys.setdefaultencoding with that encoding (it has a custom sitecustomize.py).

0人赞添加讨论(0) 举报

别忘想泡老子

3楼-- · 2019-05-11 05:58

What you're running into is that raw_input gives you a byte string, but the string you're comparing against is a Unicode string. Python 2 tries to convert them to a common type to compare, but this fails because it can't guess the encoding of the byte string - so, your solution is to do the conversion explicitly.

As a rule, you should keep all strings in your program floating around as unicode strings - anything that you read in as bytes convert to unicode straight away; anything you have as a literal in your program, make it a unicode literal unless it explicitly needs to be a bytestring for some reason. This results in the unicode sandwich, which will generally make your life easier.

For the literals, you either want to declare your strings as u'má', or have:

from __future__ import unicode_literals

near the top of your script to make 'un-prefixed strings' unicode. The error you're getting implies you've already done this bit.

To read a unicode string in, you need to realise that raw_input gives you a bytestring - so, you need to convert it using its .decode method. You need to pass .decode the encoding of your STDIN - which is available as sys.stdin.encoding (don't just assume that this is UTF8 - it often will be, but not always) - so, the whole line will be:

string = raw_input(...).decode(sys.stdin.encoding)

But by far the easiest way around this is to upgrade to Python 3 if you can - there, input() (which behaves like the Py2 raw_input otherwise) gives you a unicode string (it calls .decode for you so you don't have to remember it), and unprefixed strings are unicode strings by default. Which all makes for a much easier time working with accented characters - it essentially implies that the logic you were trying would just work in Py3, since it does the right thing.

Note, however, that the error you're seeing would still manifest in Py3 - but since it does the right thing by default, you have to work hard to run into it. But if you did, the comparison would just be False, with no warning - Py3 doesn't ever try to implictly convert between byte and unicode strings, so any byte string will always compare unequal to any unicode string, and trying to order them will throw an exception.

0人赞添加讨论(0) 举报

霸刀☆藐视天下

4楼-- · 2019-05-11 06:16

One option is to strip away the characters accents as done in :: What is the best way to remove accents in a python unicode string? After reading on in other locations I found that you can set the option of # -*- coding: utf-8 -*- right after the #!/usr/bin/python to keep all strings in unicode which may help. which in that case you may need to run s = raw_input().decode('utf8') to get the correct unicode.

0人赞添加讨论(0) 举报