I know that / is illegal in Linux, and the following are illegal in Windows
(I think) *
.
"
/
\
[
]
:
;
|
=
,
What else am I missing?
I need a comprehensive guide, however, and one that takes into account double-byte characters. Linking to outside resources is fine with me.
I need to first create a directory on the filesystem using a name that may contain forbidden characters, so I plan to replace those characters with underscores. I then need to write this directory and its contents to a zip file (using Java), so any additional advice concerning the names of zip directories would be appreciated.
Well, if only for research purposes, then your best bet is to look at this Wikipedia entry on Filenames.
If you want to write a portable function to validate user input and create filenames based on that, the short answer is don't. Take a look at a portable module like Perl's File::Spec to have a glimpse to all the hops needed to accomplish such a "simple" task.
A “comprehensive guide” of forbidden filename characters is not going to work on Windows because it reserves filenames as well as characters. Yes, characters like
*
"
?
and others are forbidden, but there are a infinite number of names composed only of valid characters that are forbidden. For example, spaces and dots are valid filename characters, but names composed only of those characters are forbidden.Windows does not distinguish between upper-case and lower-case characters, so you cannot create a folder named
A
if one nameda
already exists. Worse, seemingly-allowed names likePRN
andCON
, and many others, are reserved and not allowed. Windows also has several length restrictions; a filename valid in one folder may become invalid if moved to another folder. The rules for naming files and folders is on MSDN.You cannot, in general, use user-generated text to create Windows directory names. If you want to allow users to name anything they want, you have to create safe names like
A
,AB
,A2
et al., store user-generated names and their path equivalents in an application data file, and perform path mapping in your application.If you absolutely must allow user-generated folder names, the only way to tell if they are invalid is to catch exceptions and assume the name is invalid. Even that is fraught with peril, as the exceptions thrown for denied access, offline drives, and out of drive space overlap with those that can be thrown for invalid names. You are opening up one huge can of hurt.
Let's keep it simple and answer the question, first.
The forbidden printable ASCII characters are:
Linux/Unix:
Windows:
Non-printable characters
If your data comes from a source that would permit non-printable characters then there is more to check for.
Linux/Unix:
Windows:
Note: While it is legal under Linux/Unix file systems to create files with control characters in the filename, it might be a nightmare for the users to deal with such files.
Reserved file names
The following filenames are reserved:
Windows:
(both on their own and with arbitrary file extensions, e.g.
LPT1.txt
).Other rules
Windows:
Filenames cannot end in a space or dot.
The easy way to get Windows to tell you the answer is to attempt to rename a file via Explorer and type in / for the new name. Windows will popup a message box telling you the list of illegal characters.
https://support.microsoft.com/en-us/kb/177506
Instead of creating a blacklist of characters, you could use a whitelist. All things considered, the range of characters that make sense in a file or directory name context is quite short, and unless you have some very specific naming requirements your users will not hold it against your application if they cannot use the whole ASCII table.
It does not solve the problem of reserved names in the target file system, but with a whitelist it is easier to mitigate the risks at the source.
In that spirit, this is a range of characters that can be considered safe:
And any additional safe characters you wish to allow. Beyond this, you just have to enforce some additional rules regarding spaces and dots. This is usually sufficient:
This already allows quite complex and nonsensical names. For example, these names would be possible with these rules, and be valid file names in Windows/Linux:
A...........ext
B -.- .ext
In essence, even with so few whitelisted characters you should still decide what actually makes sense, and validate/adjust the name accordingly. In one of my applications, I used the same rules as above but stripped any duplicate dots and spaces.
Though the only illegal Unix chars might be
/
andNULL
, although some consideration for command line interpretation should be included.For example, while it might be legal to name a file
1>&2
or2>&1
in Unix, file names such as this might be misinterpreted when used on a command line.Similarly it might be possible to name a file
$PATH
, but when trying to access it from the command line, the shell will translate$PATH
to its variable value.