I have a string that I want to use as a filename, so I want to remove all characters that wouldn't be allowed in filenames, using Python.
I'd rather be strict than otherwise, so let's say I want to retain only letters, digits, and a small set of other characters like "_-.() "
. What's the most elegant solution?
The filename needs to be valid on multiple operating systems (Windows, Linux and Mac OS) - it's an MP3 file in my library with the song title as the filename, and is shared and backed up between 3 machines.
Another issue that the other comments haven't addressed yet is the empty string, which is obviously not a valid filename. You can also end up with an empty string from stripping too many characters.
What with the Windows reserved filenames and issues with dots, the safest answer to the question “how do I normalise a valid filename from arbitrary user input?” is “don't even bother try”: if you can find any other way to avoid it (eg. using integer primary keys from a database as filenames), do that.
If you must, and you really need to allow spaces and ‘.’ for file extensions as part of the name, try something like:
Even this can't be guaranteed right especially on unexpected OSs — for example RISC OS hates spaces and uses ‘.’ as a directory separator.
It doesn't handle empty strings, special filenames ('nul', 'con', etc).
You can look at the Django framework for how they create a "slug" from arbitrary text. A slug is URL- and filename- friendly.
Their
template/defaultfilters.py
(at around line 183) defines a function,slugify
, that's probably the gold standard for this kind of thing. Essentially, their code is the following.There's more, but I left it out, since it doesn't address slugification, but escaping.
Most of these solutions don't work.
'/hello/world' -> 'helloworld'
'/helloworld'/ -> 'helloworld'
This isn't what you want generally, say you are saving the html for each link, you're going to overwrite the html for a different webpage.
I pickle a dict such as:
2 represents the number that should be appended to the next filename.
I look up the filename each time from the dict. If it's not there, I create a new one, appending the max number if needed.
This whitelist approach (ie, allowing only the chars present in valid_chars) will work if there aren't limits on the formatting of the files or combination of valid chars that are illegal (like ".."), for example, what you say would allow a filename named " . txt" which I think is not valid on Windows. As this is the most simple approach I'd try to remove whitespace from the valid_chars and prepend a known valid string in case of error, any other approach will have to know about what is allowed where to cope with Windows file naming limitations and thus be a lot more complex.
What is the reason to use the strings as file names? If human readability is not a factor I would go with base64 module which can produce file system safe strings. It won't be readable but you won't have to deal with collisions and it is reversible.
Update: Changed based on Matthew comment.