I am working on a library which I want to keep platform, filesystem and Python2.x/3.x independent. However, I don't know how to glob for files and match the filenames against regular expressions in a platform/file-system independent way.
E.g. (on Mac, using IPython, Python 2.7):
In[7]: from glob import glob
In[8]: !touch 'ü-0.é' # Create the file in the current folder
In[9]: glob(u'ü-*.é')
Out[9]: []
In[10]: import unicodedata as U
In[11]: glob(U.normalize('NFD', u'ü-*.é'))
Out[11]: [u'u\u0308-0.e\u0301']
However, this doesn't work on Linux or Windows, where I would need unicode.normalize('NFC', u'ü-*.é')
. The same problem arises when I try to match the filename against a regular expression: only a unicode regular expression normalized as NFD
on Mac matches the filename whereas only an NFC
regular expression matches filenames read on Linux/Windows (I use the re.UNICODE
flag in both instances).
Is there a standard way of handling this problem?
My hope is that just like sys.getfilesystemencoding()
returns the encoding for the file system, there would exist a function which returns the Unicode normalization used by the underlying filesystem.
However, I could find neither such a function nor a safe/standard way to feature-test for it.
Mac + HFS+
uses NFD normalization: https://apple.stackexchange.com/a/10484
Linux + Windows use NFC normalization: http://qerub.se/filenames-and-unicode-normalization-forms
Link to code: https://github.com/musically-ut/seqfile/blob/feat-unicode/seqfile/seqfile.py
I'm assuming you want to match unicode equivalent filenames, e.g. you expect an input pattern of u'\xE9*'
to match both filenames u'\xE9qui'
and u'e\u0301qui'
on any operating system, i.e. character-level pattern matching.
You have to understand that this is not the default on Linux, where bytes are taken as bytes, and where not every filename is a valid unicode string in the current system encoding (although Python 3 uses the 'surrogateescape' error handler to represent these as str
anyway).
With that in mind, this is my solution:
def myglob(pattern, directory=u'.'):
pattern = unicodedata.normalize('NFC', pattern)
results = []
enc = sys.getfilesystemencoding()
for name in os.listdir(directory):
if isinstance(name, bytes):
try:
name = name.decode(enc)
except UnicodeDecodeError:
# Filenames that are not proper unicode won't match any pattern
continue
if fnmatch.filter([unicodedata.normalize('NFC', name)], pattern):
results.append(name)
return results
This is how I solve the problem:
import unicodedata as U
# ...
globPattern = os.path.join(folder, prefix + u'*' + suffix)
rawRegEx = prefix + u'([0-9]+)' + suffix + u'$'
# Mac uses NFD normalization for Unicode filenames while windows
# linux/windows use NFC normalization
if sys.platform.startswith('darwin'):
normalizedGlobPattern = U.normalize('NFD', globPattern)
normalizedRegEx = U.normalize('NFD', rawRegEx)
else:
normalizedGlobPattern = U.normalize('NFC', globPattern)
normalizedRegEx = U.normalize('NFC', rawRegEx)
allFiles = glob.glob(normalizedGlobPattern)
# ...
numFilesRegEx = re.compile(normalizedRegEx, _re.UNICODE)
numberedFiles = (re.search(numFilesRegEx, f) for f in allFiles
if re.search(numFilesRegEx, f))
This seems to pass all tests I could throw at it on AppVeyor (Windows), Travis (Linux) and my laptop (Mac + HFS+
).
However, I am not sure whether this is safe or whether there is a better way of writing it. For e.g., I don't know whether it will work on a Mac with a NFS mounted on it.