Convert python filenames to unicode

2020-02-10 03:04发布

问题:

I am on python 2.6 for Windows.

I use os.walk to read a file tree. Files may have non-7-bit characters (German "ae" for example) in their filenames. These are encoded in Pythons internal string representation.

I am processing these filenames with Python library functions and that fails due to wrong encoding.

How can I convert these filenames to proper (unicode?) python strings?

I have a file "d:\utest\ü.txt". Passing the path as unicode does not work:

>>> list(os.walk('d:\\utest'))
[('d:\\utest', [], ['\xfc.txt'])]
>>> list(os.walk(u'd:\\utest'))
[(u'd:\\utest', [], [u'\xfc.txt'])]

回答1:

If you pass a Unicode string to os.walk(), you'll get Unicode results:

>>> list(os.walk(r'C:\example'))          # Passing an ASCII string
[('C:\\example', [], ['file.txt'])]
>>> 
>>> list(os.walk(ur'C:\example'))        # Passing a Unicode string
[(u'C:\\example', [], [u'file.txt'])]


回答2:

I was looking for a solution for Python 3.0+. Will put it up here incase someone else needs it.

rootdir = r'D:\COUNTRY\ROADS\'
fs_enc = sys.getfilesystemencoding()
for (root, dirname, filename) in os.walk(rootdir.encode(fs_enc)):
    # do your stuff here, but remember that now
    # root, dirname, filename are represented as bytearrays


回答3:

os.walk(unicode(root_dir, 'utf-8'))


回答4:

a more direct way might be to try the following -- find your file system's encoding, and then convert that to unicode. for example,

unicode_name = unicode(filename, "utf-8", errors="ignore")

to go the other way,

unicode_name.encode("utf-8")


回答5:

os.walk isn't specified to always use os.listdir, but neither is it listed how Unicode is handled. However, os.listdir does say:

Changed in version 2.3: On Windows NT/2k/XP and Unix, if path is a Unicode object, the result will be a list of Unicode objects. Undecodable filenames will still be returned as string objects.

Does simply using a Unicode argument work for you?

for dirpath, dirnames, filenames in os.walk(u"."):
  print dirpath
  for fn in filenames:
    print "   ", fn


回答6:

No, they are not encoded in Pythons internal string representation, there is no such thing. They are encoded in the encoding of the operating system/file system. Passing in unicode works for os.walk though.

I don't know how os.walk behaves when filenames can't be decoded, but I assume that you'll get a string back, like with os.listdir(). In that case you'll again have problems later. Also, not all of Python 2.x standard library will accept unicode parameters properly, so you may need to encode them as strings anyway. So, the problem may in fact be somewhere else, but you'll notice if that is the case. ;-)

If you need more control of the decoding you can't always pass in a string, and then just decode it with filename = filename.decode() as usual.