What is the correct way of processing different st

2019-08-05 09:08发布

问题:

I need some clarifications.

The problem is I have a program for windows written in C++ which uses 'wmain' windows-specific function that accepts wchar_t** as its args. So, there is an opportunity to pass whatever-you-like as a command line parameters to such program: for example, Chinese symbols, Japanese ones, etc, etc.

To be honest, I have no information about the encoding this function is usually used with. Probably utf-32, or even utf-16. So, the questions:

  • What is the not windows-specific, but unix/linux way to achieve this with standard main function? My first thoughts were about usage of utf-8 encoded input strings with some kind of locales specifying?

  • Can somebody give a simple example of such main function? How can a std::string hold a Chinese symbols?

  • Can we operate with Chinese symbols encoded in utf-8 and contained in std::strings as usual when we just access each char (byte) like this: string_object[i] ?

回答1:

Disclaimer: All Chinese words provided by GOOGLE translate service.

1) Just proceed as normal using normal std::string. The std::string can hold any character encoding and argument processing is simple pattern matching. So on a Chinese computer with the Chinese version of the program installed all it needs to do is compare Chinese versions of the flags to what the user inputs.

2) For example:

#include <string>
#include <vector>
#include <iostream>

std::string arg_switch = "开关";
std::string arg_option = "选项";
std::string arg_option_error = "缺少参数选项";

int main(int argc, char* argv[])
{
    const std::vector<std::string> args(argv + 1, argv + argc);

    bool do_switch = false;
    std::string option;

    for(auto arg = args.begin(); arg != args.end(); ++arg)
    {
        if(*arg == "--" + arg_switch)
            do_switch = true;
        else if(*arg == "--" + arg_option)
        {
            if(++arg == args.end())
            {
                // option needs a value - not found
                std::cout << arg_option_error << '\n';
                return 1;
            }
            option = *arg;
        }
    }

    std::cout << arg_switch << ": " << (do_switch ? "on":"off") << '\n';
    std::cout << arg_option << ": " << option << '\n';

    return 0;
}

Usage:

./program --开关 --选项 wibble

Output:

开关: on
选项: wibble

3) No.

For UTF-8/UTF-16 data we need to use special libraries like ICU

For character by character processing you need to use or convert to UTF-32.



回答2:

In short:

int main(int argc, char **argv) { setlocale(LC_CTYPE, ""); // ... }

http://unixhelp.ed.ac.uk/CGI/man-cgi?setlocale+3

And then you use mulitbyte string functions. You can still use normal std::string for storing multibyte strings, but beware that characters in them may span multiple array cells. After successfully setting the locale, you can also use wide streams (wcin, wcout, wcerr) to read and write wide strings from the standard streams.



回答3:

1) with linux, you'd get standard main(), and standard char. It would use UTF-8 encoding. So chineese specific characters would be included in the string with a multibyte encoding.
***Edit:**sorry, yes: you have to set the default "" locale like here as well as cout.imbue().*

2) All the classic main() examples would be good examples. As said, chineese specific characters would be included in the string with a multibyte encoding. So if you cout such a string with the default UTF-8 locale, the cout sream would interpret the special UTF8 encoded sequences, knowing it has to agregate between 2 and 6 of each in order to produce the chineese output.

3) you can operate as usual on strings. THere are some issues however if you cout the string length for example: there is a difference between memory (ex: 3 bytes) and the chars that the user sees (ex: only 1). Same if you move with a pointer forward or backward. You have to make sure you interpret mulrtibyte encoding correctly, in order not to output an invalid encoding.

You could be interested in this other SO question.

Wikipedia explains the logic of the UTF-8 multibyte encoding. From this article you'll understand that any char u is a multibyte encoded char if:

( ((u & 0xE0) == 0xC0)
       || ((u & 0xF0) == 0xE0)
       || ((u & 0xF8) == 0xF0)
       || ((u & 0xFC) == 0xF8)
       || ((u & 0xFE) == 0xFC) ) 

It is followed by one or several chars such as:

((u & 0xC0) == 0x80)

All other chars are ASCII chars (i.e. not multibyte).