I'm creating library which will be used for file manipulations, both on linux and windows. So I need to handle paths, the main requirements is that my functions will recieve strings in UTF8 format. But it causes some problems, one of them is I'm using MAX_PATH
on windows and PATH_MAX
in linux, to represent static path variables. In the case of ASCII characters there will be no problem, but when path contains unicode characters, the length of path will be twice shorter if unicode char requires 2 bytes per char, 3 times shorter if unicode char requires 3 bytes per char and so on. So is there good solution for this problem?
Thanks in advance!
p.s. sorry for my english.
UTF-8 is multibyte encoding format ranging from 1 to 4 bytes per character. As you want to statically define max path value, you may need to define max path as
n*4
(wheren
is the path length in ASCII characters you want to define) to accommodate UTF-8 encoded characters.That totally depends on what you need.
If you want MAX_PATH number of bytes, you simply define a buffer as
char name[MAX_PATH]
. If you wantMAX_PATH
number of characters, you define a buffer aschar name[MAX_PATH * 4]
, as UTF-8 encodes each Unicode character as a variable number of 1 to 4 octets.In a word, as janneb points out,
MAX_PATH (or PATH_MAX)
specifies the number of underlying bytes instead of characters.At least on Linux, your concern seems misplaced. Linux (and POSIX in general) treats paths as an opaque blob of bytes terminated by "\0". It does not concern itself with how those bytes are translated to characters. That is, PATH_MAX specifies the max length of a path name in bytes, not in characters.
So if the path names contains >= 0 multibyte UTF-8 characters, then it just means that the max path length in characters is <= PATH_MAX.
Doesn’t Microsoft use either UCS-2 or UTF-16 for its pathnames, and that so MAX_PATH has a length that reflects 16-bit code units, not even proper characters?
I know that Apple uses UTF-16, and that each component in a pathname can be up to 256 UTF-16 code units not characters, and that it normalized to something approximating NFD from a long time ago.
I suspect you will have to first normalize if necessary, such as to NFD for Apple, then encode to your native filesystem’s internal format, and then check the length.
When you do that comparison, it is critical to remember that Unix uses 8-bit code units, Microsoft and Apple use 16-bit code units, and that no one seems to bother to actually use abstract characters. They could do that if they used UTF-32, but nobody wastes that much space in the filesystem. Pity, that.