I am using argparse
to read in arguments for my python code. One of those inputs is a title of a file [title
] which can contain Unicode characters. I have been using 22少女時代22
as a test string.
I need to write the value of the input title
to a file, but when I try to convert the string to UTF-8
it always throws an error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0x8f in position 2: ordinal
not in range(128)
I have been looking around and see I need my string to be in the form u"foo"
to call .encode()
on it.
When I run type()
on my input from argparse
I see:
<type 'str'>
I am looking to get a response of:
<type 'unicode'>
How can I get it in the right form?
Idea:
Modify argparse
to take in a str
but store it as a unicode string u"foo"
:
parser.add_argument(u'title', metavar='T', type=unicode, help='this will be unicode encoded.')
This approach is not working at all. Thoughts?
Edit 1:
Some sample code where title
is 22少女時代22
:
inputs = vars(parser.parse_args())
title = inputs["title"]
print type(title)
print type(u'foo')
title = title.encode('utf8') # This line throws the error
print title
It looks like your input data is in SJIS encoding (a legacy encoding for Japanese), which produces the byte 0x8f at position 2 in the bytestring:
>>> '22少女時代22'.encode('sjis')
b'22\x8f\xad\x8f\x97\x8e\x9e\x91\xe322'
(At Python 3 prompt)
Now, I'm guessing that to "convert the string to UTF-8", you used something like
title.encode('utf8')
The problem is that title
is actually a bytestring containing the SJIS-encoded string. Due to a design flaw in Python 2, bytestrings can be directly encode
d, and it assumes the bytestring is ASCII-encoded. So what you have is conceptually equivalent to
title.decode('ascii').encode('utf8')
and of course the decode
call fails.
You should instead explicitly decode from SJIS to a Unicode string, before encoding to UTF-8:
title.decode('sjis').encode('utf8')
As Mark Tolonen pointed out, you're probably typing the characters into your console, and it's your console encoding is a non-Unicode encoding.
So it turns out your sys.stdin.encoding
is cp932
, which is Microsoft's variant of SJIS. For this, use
title.decode('cp932').encode('utf8')
You really should set your console encoding to the standard UTF-8, but I'm not sure if that's possible on Windows. If you do, you can skip the decoding/encoding step and just write your input bytestring to the file.
Setting type=unicode
is like using unicode(arg)
which defaults to decoding with ascii
on Python 2.X. If running from the console, sys.stdin.encoding
is the encoding used for input, so something like:
inputs = vars(parser.parse_args())
title = inputs["title"]
print type(title)
print type(u'foo')
title = title.decode(sys.stdin.encoding)
print title
Something that should work no matter the encoding on Windows is the mbcs
encoding, which represents the current encoding used by non-Unicode Windows programs. That seems to be what argparse
is using, because I sys.stdin.encoding
is the OEM console encoding which isn't always the same as the Windows encoding. On US Windows, cp437
is the console OEM encoding and cp1252
is the Windows encoding:
import argparse
import codecs
parser = argparse.ArgumentParser()
parser.add_argument(u'title', metavar='T', type=str, help='this will be unicode encoded.')
opts = parser.parse_args()
title = opts.title.decode('mbcs')
with codecs.open('out.txt','w',encoding='utf-8-sig') as f:
f.write(title)
out.txt
should show the original input in Notepad.
The utf-8-sig
encoding writes the so-called byte order mark (BOM) that Windows likes at the beginning of UTF-8 files. utf-8
can be used if that is not desired, but Notepad likes it.
So, this actually works for me:
import argparse
parser = argparse.ArgumentParser()
parser.add_argument(u'title', metavar='T', type=str, help='this will be unicode encoded.')
opts = parser.parse_args()
print opts.title.decode('utf8')
My terminal emulator (OS X Terminal.app) uses UTF-8. If your terminal is not configured for UTF-8 operation, then it won't work (and then it's a terminal problem, not a Python issue).