All my scripts use Unicode literals throughout, with
from __future__ import unicode_literals
but this creates a problem when there is the potential for functions being called with bytestrings, and I'm wondering what the best approach is for handling this and producing clear helpful errors.
I gather that one common approach, which I've adopted, is to simply make this clear when it occurs, with something like
def my_func(somearg):
"""The 'somearg' argument must be Unicode."""
if not isinstance(arg, unicode):
raise TypeError("Parameter 'somearg' should be a Unicode")
# ...
for all arguments that need to be Unicode (and might be bytestrings). However even if I do this, I encounter problems with my argparse
command line script if supplied parameters correspond to such arguments, and I wonder what the best approach here is. It seems that I can simply check the encoding of such arguments, and decode them using that encoding, with, for example
if __name__ == '__main__':
parser = argparse.ArgumentParser(...)
parser.add_argument('somearg', ...)
# ...
args = parser.parse_args()
some_arg = args.somearg
if not isinstance(config_arg, unicode):
some_arg = some_arg.decode(sys.getfilesystemencoding())
#...
my_func(some_arg, ...)
Is this combination of approaches a common design pattern for Unicode modules that may receive bytestring inputs? Specifically,
- can I reliable decode command line arguments in this way, and
- will
sys.getfilesystemencoding()
give me the correct encoding for command line arguments; or - does
argparse
provide some builtin facility for accomplishing this that I've missed?
sys.getfilesystemencoding()
is the correct(but see examples) encoding for OS data such as filenames, environment variables, and command-line arguments.You could see the logic behind the choice:
sys.argv[0]
may be the path to the script (the filename) and therefore it is natural to assume that it uses the same encoding as other filenames and that other items in theargv
list use the same character encoding assys.argv[0]
.os.environ['PATH']
contains paths and therefore it is also natural that environment variables use the same encoding:Note:
sys.argv[0]
is the script filename whatever other command-line arguments you might have."best way" depends on your specific use-case e.g., on Windows, you should probably use Unicode API directly (
CommandLineToArgvW()
). On POSIX, if all you need is to pass someargv
items to OS functions back (such asos.listdir()
) then you could leave them as bytes -- command-line argument can be arbitrary byte sequence, see PEP 0383 -- Non-decodable Bytes in System Character Interfaces:As you can see POSIX allows to pass any bytes (except zero).
Obviously, you can also misconfigure your environment:
The output shows that
€
is encoded using utf-8 but both locale andPYTHONIOENCODING
are configured differently.The examples demonstrate that
sys.argv
may be encoded using a character encoding that does not correspond to any of the standard encodings or it even may contain arbitrary (except zero byte) binary data on POSIX (no character encoding). On Windows, I guess, you could paste a Unicode string that can't be encoded using ANSI or OEM Windows encodings but you might get the correct value using Unicode API anyway (Python 2 probably drops data here).Python 3 uses Unicode
sys.argv
and therefore it shouldn't lose data on Windows (Unicode API is used) and it allows to demonstrate thatsys.getfilesystemencoding()
is used (notsys.stdin.encoding
) to decodesys.argv
on Linux (wheresys.getfilesystemencoding()
is derived from locale):The output shows that
LANG
that defines locale in this case that definessys.getfilesystemencoding()
on Linux is used to decode the command-line arguments:I don't think
getfilesystemencoding
will necessarily get the right encoding for the shell, it depends on the shell (and can be customised by the shell, independent of the filesystem). The file system encoding is only concerned with how non-ascii filenames are stored.Instead, you should probably be looking at
sys.stdin.encoding
which will give you the encoding for standard input.Additionally, you might consider using the
type
keyword argument when you add an argument:Demo:
If you have to work with non-ascii data a lot, I would highly recommend upgrading to python3. Everything is a lot easier there, for example, parsed arguments will already be unicode on python3.
Since there is conflicting information about the command line argument encoding around, I decided to test it by changing my shell encoding to latin-1 whilst leaving the file system encoding as utf-8. For my tests I use the c-cedilla character which has a different encoding in these two:
Now I create an example script:
Then I change my shell encoding to
ISO/IEC 8859-1
:And I call the script:
As you can see, the command line arguments were encoding in latin-1, and so the second command line argument (using
sys.getfilesystemencoding
) fails to decode. The third command line argument (usingsys.stdin.encoding
) decodes correctly.