Convert raw byte string to Unicode without knowing

2020-04-19 19:32发布

问题:

When using the right-click menu context, windows passes file path as raw (byte) string type.

For example:

path = 'C:\\MyDir\\\x99\x8c\x85\x8d.mp3'

Many external packages in my application are expecting unicode type strings, so I have to convert it into unicode.

That would be easy if we'd known the raw string's encoding beforehand (In the example, it is cp1255). However I can't know which encoding will be used locally on each computer around the world.

How can I convert the string into unicode? Perhaps using win32api is needed?

回答1:

No idea why you might be getting the DOS code page (862) instead of ANSI (1255) - how is the right-click option set up?

Either way - if you need to accept any arbitrary Unicode character in your arguments you can't do it from Python 2's sys.argv. This list is populated from the bytes returned by the non-Unicode version of the Win32 API (GetCommandLineA), and that encoding is never Unicode-safe.

Many other languages including Java and Ruby are in the same boat; the limitation comes from the Microsoft C runtime's implementations of the C standard library functions. To fix it, one would call the Unicode version (GetCommandLineW) on Windows instead of relying on the cross-platform standard library. Python 3 does this.

In the meantime for Python 2, you can do it by calling GetCommandLineW yourself but it's not especially pretty. You can also use CommandLineToArgvW if you want Windows-style parameter splittng. You can do this with win32 extensions or also just plain ctypes.

Example (though the step of encoding the Unicode string back to UTF-8 bytes is best skipped).



回答2:

Usually I use own util function for safe conversion from usual codepages to unicode. For reading default OS encoding probably locale.getpreferredencoding function could help (http://docs.python.org/2/library/locale.html#locale.getpreferredencoding).

Example of util function that tries to converting to unicode by iterating some predefined encodings:

# coding: utf-8
def to_unicode(s):
    if isinstance(s, unicode): return s

    from locale import getpreferredencoding
    for cp in (getpreferredencoding(), "cp1255", "cp1250"):
        try:
            return unicode(s, cp)
        except UnicodeDecodeError:
            pass
    raise Exception("Conversion to unicode failed")
    # or fallback like:
    # return unicode(s, getpreferredencoding(), "replace")

print (to_unicode("addđšđč枎ŠĐ"))

Fallback could be enabled by using unicode function argument errors="replace". Reference http://docs.python.org/2/library/functions.html#unicode

For converting back to some codepage you can check this.