how do I get Cyrillic instead of u'...
the code is like this
def openfile(filename):
with codecs.open(filename, encoding="utf-8") as F:
raw = F.read()
do stuff...
print some_text
prints
>>>[u'.', u',', u':', u'\u0432', u'<', u'>', u'(', u')', u'\u0437', u'\u0456']
It looks like some_text
is a list of unicode objects. When you print such a list, it prints the reprs
of the elements inside the list. So instead try:
print(u''.join(some_text))
The join method concatenates the elements of some_text
, with an empty space, u''
, in between the elements. The result is one unicode object.
It's not clear to me where some_text
comes from (you cut out that bit of your code), so I have no idea why it prints as a list of characters rather than a string.
But you should be aware that by default, Python tries to encode strings as ASCII when you print them to the terminal. If you want them to be encoded in some other coding system, you can do that explicitly:
>>> text = u'\u0410\u0430\u0411\u0431'
>>> print text
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3:
ordinal not in range(128)
>>> print text.encode('utf8')
АаБб
u'\uNNNN'
is the ASCII-safe version of the string literal u'з'
:
>>> print u'\u0437'
з
However this will only display right for you if your console supports the character you are trying to print. Trying the above on the console on a Western European Windows install fails:
>>> print u'\u0437'
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python27\lib\encodings\cp437.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\u0437' in position 0: character maps to <undefined>
Because getting the Windows console to output Unicode is tricky, Python 2's repr
function always opts for the ASCII-safe literal version.
Your print
statement is outputting the repr
version and not printing characters directly because you've got them inside a list of characters instead of a string. If you did print
on each of the members of the list, you'd get the characters output directly and not represented as u'...'
string literals.