I'm writing a program that asks the user for input that contains accents. The user input string is tested to see if it matches a string declared in the program. As you can see below, my code is not working:
code
# -*- coding: utf-8 -*-
testList = ['má']
myInput = raw_input('enter something here: ')
print myInput, repr(myInput)
print testList[0], repr(testList[0])
print myInput in testList
output in eclipse with pydev
enter something here: má
m√° 'm\xe2\x88\x9a\xc2\xb0'
má 'm\xc3\xa1'
False
output in IDLE
enter something here: má
má u'm\xe1'
má 'm\xc3\xa1'
Warning (from warnings module):
File "/Users/ryanculkin/Desktop/delete.py", line 8
print myInput in testList
UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
False
How can I get my code to print True when comparing the two strings?
Additionally, I note that the result of running this code on the same input is different depending on whether I use eclipse or IDLE. Why is this? My eventual goal is to put my program on the web; is there anything that I need to be aware of, since the result seems to be so volatile?
Just to note, you have a difference from IDLE vs. PyDev because PyDev will set the PYTHONIOENCODING to the encoding in your launch configuration > common > encoding. And will also do a sys.setdefaultencoding with that encoding (it has a custom sitecustomize.py).
What you're running into is that
raw_input
gives you a byte string, but the string you're comparing against is a Unicode string. Python 2 tries to convert them to a common type to compare, but this fails because it can't guess the encoding of the byte string - so, your solution is to do the conversion explicitly.As a rule, you should keep all strings in your program floating around as unicode strings - anything that you read in as bytes convert to unicode straight away; anything you have as a literal in your program, make it a unicode literal unless it explicitly needs to be a bytestring for some reason. This results in the unicode sandwich, which will generally make your life easier.
For the literals, you either want to declare your strings as
u'má'
, or have:near the top of your script to make
'un-prefixed strings'
unicode. The error you're getting implies you've already done this bit.To read a unicode string in, you need to realise that
raw_input
gives you a bytestring - so, you need to convert it using its.decode
method. You need to pass.decode
the encoding of your STDIN - which is available assys.stdin.encoding
(don't just assume that this is UTF8 - it often will be, but not always) - so, the whole line will be:But by far the easiest way around this is to upgrade to Python 3 if you can - there,
input()
(which behaves like the Py2raw_input
otherwise) gives you a unicode string (it calls.decode
for you so you don't have to remember it), and unprefixed strings are unicode strings by default. Which all makes for a much easier time working with accented characters - it essentially implies that the logic you were trying would just work in Py3, since it does the right thing.Note, however, that the error you're seeing would still manifest in Py3 - but since it does the right thing by default, you have to work hard to run into it. But if you did, the comparison would just be False, with no warning - Py3 doesn't ever try to implictly convert between byte and unicode strings, so any byte string will always compare unequal to any unicode string, and trying to order them will throw an exception.
One option is to strip away the characters accents as done in :: What is the best way to remove accents in a python unicode string? After reading on in other locations I found that you can set the option of
# -*- coding: utf-8 -*-
right after the#!/usr/bin/python
to keep all strings in unicode which may help. which in that case you may need to runs = raw_input().decode('utf8')
to get the correct unicode.