When using the right-click menu context, windows passes file path as raw (byte) string type.
For example:
path = 'C:\\MyDir\\\x99\x8c\x85\x8d.mp3'
Many external packages in my application are expecting unicode
type strings, so I have to convert it into unicode
.
That would be easy if we'd known the raw string's encoding beforehand (In the example, it is cp1255
). However I can't know which encoding will be used locally on each computer around the world.
How can I convert the string
into unicode
? Perhaps using win32api
is needed?
No idea why you might be getting the DOS code page (862) instead of ANSI (1255) - how is the right-click option set up?
Either way - if you need to accept any arbitrary Unicode character in your arguments you can't do it from Python 2's sys.argv
. This list is populated from the bytes returned by the non-Unicode version of the Win32 API (GetCommandLineA
), and that encoding is never Unicode-safe.
Many other languages including Java and Ruby are in the same boat; the limitation comes from the Microsoft C runtime's implementations of the C standard library functions. To fix it, one would call the Unicode version (GetCommandLineW
) on Windows instead of relying on the cross-platform standard library. Python 3 does this.
In the meantime for Python 2, you can do it by calling GetCommandLineW
yourself but it's not especially pretty. You can also use CommandLineToArgvW
if you want Windows-style parameter splittng. You can do this with win32
extensions or also just plain ctypes
.
Example (though the step of encoding the Unicode string back to UTF-8 bytes is best skipped).
Usually I use own util function for safe conversion from usual codepages to unicode. For reading default OS encoding probably locale.getpreferredencoding function could help (http://docs.python.org/2/library/locale.html#locale.getpreferredencoding).
Example of util function that tries to converting to unicode by iterating some predefined encodings:
# coding: utf-8
def to_unicode(s):
if isinstance(s, unicode): return s
from locale import getpreferredencoding
for cp in (getpreferredencoding(), "cp1255", "cp1250"):
try:
return unicode(s, cp)
except UnicodeDecodeError:
pass
raise Exception("Conversion to unicode failed")
# or fallback like:
# return unicode(s, getpreferredencoding(), "replace")
print (to_unicode("addđšđč枎ŠĐ"))
Fallback could be enabled by using unicode function argument errors="replace". Reference http://docs.python.org/2/library/functions.html#unicode
For converting back to some codepage you can check this.