Filesystem independent way of using glob.glob and

2019-02-19 10:03发布

问题:

I am working on a library which I want to keep platform, filesystem and Python2.x/3.x independent. However, I don't know how to glob for files and match the filenames against regular expressions in a platform/file-system independent way.

E.g. (on Mac, using IPython, Python 2.7):

   In[7]: from glob import glob
   In[8]: !touch 'ü-0.é' # Create the file in the current folder

   In[9]: glob(u'ü-*.é')
  Out[9]: []

   In[10]: import unicodedata as U

   In[11]: glob(U.normalize('NFD', u'ü-*.é'))
  Out[11]: [u'u\u0308-0.e\u0301']

However, this doesn't work on Linux or Windows, where I would need unicode.normalize('NFC', u'ü-*.é'). The same problem arises when I try to match the filename against a regular expression: only a unicode regular expression normalized as NFD on Mac matches the filename whereas only an NFC regular expression matches filenames read on Linux/Windows (I use the re.UNICODE flag in both instances).

Is there a standard way of handling this problem?

My hope is that just like sys.getfilesystemencoding() returns the encoding for the file system, there would exist a function which returns the Unicode normalization used by the underlying filesystem.

However, I could find neither such a function nor a safe/standard way to feature-test for it.


Mac + HFS+ uses NFD normalization: https://apple.stackexchange.com/a/10484

Linux + Windows use NFC normalization: http://qerub.se/filenames-and-unicode-normalization-forms

Link to code: https://github.com/musically-ut/seqfile/blob/feat-unicode/seqfile/seqfile.py

回答1:

I'm assuming you want to match unicode equivalent filenames, e.g. you expect an input pattern of u'\xE9*' to match both filenames u'\xE9qui' and u'e\u0301qui' on any operating system, i.e. character-level pattern matching.

You have to understand that this is not the default on Linux, where bytes are taken as bytes, and where not every filename is a valid unicode string in the current system encoding (although Python 3 uses the 'surrogateescape' error handler to represent these as str anyway).

With that in mind, this is my solution:

def myglob(pattern, directory=u'.'):
    pattern = unicodedata.normalize('NFC', pattern)
    results = []
    enc = sys.getfilesystemencoding()
    for name in os.listdir(directory):
        if isinstance(name, bytes):
            try:
                name = name.decode(enc)
            except UnicodeDecodeError:
                # Filenames that are not proper unicode won't match any pattern
                continue
        if fnmatch.filter([unicodedata.normalize('NFC', name)], pattern):
            results.append(name)
    return results


回答2:

This is how I solve the problem:

import unicodedata as U

    # ...

    globPattern = os.path.join(folder, prefix + u'*' + suffix)
    rawRegEx = prefix + u'([0-9]+)' + suffix + u'$'

    # Mac uses NFD normalization for Unicode filenames while windows
    # linux/windows use NFC normalization
    if sys.platform.startswith('darwin'):
        normalizedGlobPattern = U.normalize('NFD', globPattern)
        normalizedRegEx = U.normalize('NFD', rawRegEx)
    else:
        normalizedGlobPattern = U.normalize('NFC', globPattern)
        normalizedRegEx = U.normalize('NFC', rawRegEx)

    allFiles = glob.glob(normalizedGlobPattern)

    # ...

    numFilesRegEx = re.compile(normalizedRegEx, _re.UNICODE)
    numberedFiles = (re.search(numFilesRegEx, f) for f in allFiles
                     if re.search(numFilesRegEx, f))

This seems to pass all tests I could throw at it on AppVeyor (Windows), Travis (Linux) and my laptop (Mac + HFS+).

However, I am not sure whether this is safe or whether there is a better way of writing it. For e.g., I don't know whether it will work on a Mac with a NFS mounted on it.