Windows Codepage Interactions with Standard C/C++

2019-02-22 23:48发布

A customer is complaining that our code used to write files with Japanese characters in the filename but no longer works in all cases. We have always just used good old char * strings to represent filenames, so it came as a bit of a shock to me that it ever worked, and we haven't done anything I am aware of that should have made it stop working. I had them send me a file with an embedded filename in it exported from our software, and it looks like the strings use hex characters 82 and 83 as the first character of a double-byte sequence to represent the Japanese characters. Poking around online leads me to believe this is probably SHIFT_JIS and/or Windows codepage 932.

It looks to me like what is happening is previously both fopen and ofstream::open accepted filenames using this codepage; now only fopen does. I've checked the Visual Studio fopen docs, and I see no hint of what makes an acceptable string to pass to fopen.

In the short run, I'm hoping someone can shed some light on the specific Windows fopen versus ofstream::open issue for me. In the long run, I'd really like to know the accepted way of opening Unicode (and other?) filenames in C++, on Windows, Linux, and OS X.

Edited to add: I believe that the opens that work are done in the "C" locale, whereas the ones that do not work are done in whatever the customer's default locale is. However, that has been the case for years now, and the old version of the program still works today on their system, so this seems a longshot for explaining the issue we are seeing.

Update: I sent off a small test program to the customer. It has verified that fopen works fine with the SHIFT_JIS filename, and std::ofstream does not. This is in Visual Studio 2005, and happened regardless of whether I used the default locale or the "C" locale.

I'm still interested if anyone has an explanation for this behavior (and why it mysteriously changed -- perhaps a service pack of VS2005?) and hoping to put together a comprehensive "best practices" for handling Unicode filenames in portable C++ code.

6条回答
狗以群分
2楼-- · 2019-02-23 00:11

You may have to set the thread locale to the system default locale. See here for a possible reason for your problems: http://connect.microsoft.com/VisualStudio/feedback/ViewFeedback.aspx?FeedbackID=100887

查看更多
别忘想泡老子
3楼-- · 2019-02-23 00:12

I'm not aware of any portable way of using unicode files using default system libraries. But there are some frameworks that provide portable functions, for example:

  • for C: glib uses filenames in UTF-8;
  • for C++: glibmm also uses filenames in UTF-8, requires glib;
  • for C++: boost can use wstring for filenames.

I'm pretty sure .NET/mono frameworks also do contain portable filesystem functions, but I don't know them.

查看更多
\"骚年 ilove
4楼-- · 2019-02-23 00:13

Mac OS X uses Unicode as its native character encoding. The basic string objects are CFString and NSString. They store array of characters as Unicode.

查看更多
闹够了就滚
5楼-- · 2019-02-23 00:14

Functions like fopen or ofstream::open take the file name as char *, but that is interpreted as being in the system code page.

It means that it can be a Japanese character represented as Shift-JIS (cp932), or Chinese Simplified (Big 5/cp936), Korean, Arabic, Russian, you name it (as long as it matches the OS system code page).

It also means that it can use Japanese file names on a Japanese system only. Change the system code page and the application "stops working" I suspect this is what happens here (no big changes in Windows since Win 2000, in this area).

This is how you change the system code page: http://www.mihai-nita.net/article.php?artID=20050611a

In the long run you might consider moving to Unicode (and using _wfopen, wofstream).

查看更多
【Aperson】
6楼-- · 2019-02-23 00:23

Is somebody still watching this? I've just researched this question and found no answers anywhere, so I can try to explain my findings here.

In VS2005 the fstream filename handling is the odd man out: it doesn't use the system default encoding, the one you get with GetACP and set in Control Panel/Region and Language/Administrative. But always CP 1252 -- I believe.

This can cause big confusion, and Microsoft has removed this quirk in later VS versions.

All workarounds for VS2005 have their drawbacks:

  1. Convert your code to use Unicode everywhere

  2. Never open fstreams using narrow character filenames, always convert to them to Unicode using the system default encoding yourself, the use wide character filename open/ctor

  3. Retrieve the codepage using GetACP(), then do a

matching setlocale:

setlocale (LC_ALL, ("." + lexical_cast<string> (GetACP())).c_str())
查看更多
▲ chillily
7楼-- · 2019-02-23 00:30

I'm nearly certain that on Linux, the filename string is a UTF-8 string (on the EXT3 filesystem, for example, the only disallowed chars are slash and NULL), stored in a normal char *. The man page doesn't seem to mention character encoding, which is what leads me to believe it is the system standard of UTF-8. OS X likely uses the same, since it comes from similar roots, but I am less sure about this.

查看更多
登录 后发表回答