Problem with handling path length

I'm creating library which will be used for file manipulations, both on linux and windows. So I need to handle paths, the main requirements is that my functions will recieve strings in UTF8 format. But it causes some problems, one of them is I'm using MAX_PATH on windows and PATH_MAX in linux, to represent static path variables. In the case of ASCII characters there will be no problem, but when path contains unicode characters, the length of path will be twice shorter if unicode char requires 2 bytes per char, 3 times shorter if unicode char requires 3 bytes per char and so on. So is there good solution for this problem?

Thanks in advance!

p.s. sorry for my english.

标签： c utf-8

4条回答

仙女界的扛把子

2楼-- · 2019-07-19 18:07

UTF-8 is multibyte encoding format ranging from 1 to 4 bytes per character. As you want to statically define max path value, you may need to define max path as n*4 (where n is the path length in ASCII characters you want to define) to accommodate UTF-8 encoded characters.

0人赞添加讨论(0) 举报

Evening l夕情丶

3楼-- · 2019-07-19 18:14

That totally depends on what you need.

If you want MAX_PATH number of bytes, you simply define a buffer as char name[MAX_PATH]. If you want MAX_PATH number of characters, you define a buffer as char name[MAX_PATH * 4], as UTF-8 encodes each Unicode character as a variable number of 1 to 4 octets.

In a word, as janneb points out, MAX_PATH (or PATH_MAX) specifies the number of underlying bytes instead of characters.

0人赞添加讨论(0) 举报

ゆ、 Hurt°

4楼-- · 2019-07-19 18:16

At least on Linux, your concern seems misplaced. Linux (and POSIX in general) treats paths as an opaque blob of bytes terminated by "\0". It does not concern itself with how those bytes are translated to characters. That is, PATH_MAX specifies the max length of a path name in bytes, not in characters.

So if the path names contains >= 0 multibyte UTF-8 characters, then it just means that the max path length in characters is <= PATH_MAX.

0人赞添加讨论(0) 举报

爱情/是我丢掉的垃圾

5楼-- · 2019-07-19 18:21

Doesn’t Microsoft use either UCS-2 or UTF-16 for its pathnames, and that so MAX_PATH has a length that reflects 16-bit code units, not even proper characters?

I know that Apple uses UTF-16, and that each component in a pathname can be up to 256 UTF-16 code units not characters, and that it normalized to something approximating NFD from a long time ago.

I suspect you will have to first normalize if necessary, such as to NFD for Apple, then encode to your native filesystem’s internal format, and then check the length.

When you do that comparison, it is critical to remember that Unix uses 8-bit code units, Microsoft and Apple use 16-bit code units, and that no one seems to bother to actually use abstract characters. They could do that if they used UTF-32, but nobody wastes that much space in the filesystem. Pity, that.

0人赞添加讨论(0) 举报

Problem with handling path length

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间