How to detect the character encoding of command li

2019-07-02 02:50发布

Is it safe to assume they are ISO-8859-15 (Window-1252?), or is there some function I can call to query this? The end goal is to conversion to UTF-8.


Background:

The problem described by this question arises because XMLStarlet assumes its command line arguments are UTF-8. Under Windows it seems they are actually ISO-8859-15 (Window-1252?), or at least adding the following to the beginning of main makes things work:

char **utf8argv = malloc(sizeof(char*) * (argc+1));
utf8argv[argc] = NULL;

{
    iconv_t windows2utf8 = iconv_open("UTF-8", "ISO-8859-15");
    int i;
    for (i = 0; i < argc; i++) {
        const char *arg = argv[i];
        size_t len = strlen(arg);
        size_t outlen = len*2 + 1;
        char *utfarg = malloc(outlen);

        char *out = utfarg;
        size_t ret = iconv(windows2utf8,
            &arg, &len,
            &out, &outlen);

        if (ret < 0) {
            perror("iconv");
            utf8argv[i] = NULL;
            continue;
        }

        out[0] = '\0';
        utf8argv[i] = utfarg;
    }

    argv = utf8argv;
}

Testing Encoding

The following program prints out the bytes of its first argument in decimal:

#include <strings.h>
#include <stdio.h>

int main(int argc, char *argv[])
{
    for (int i = 0; i < strlen(argv[1]); i++) {
        printf("%d ", (unsigned char) argv[1][i]);
    }
    printf("\n");
    return 0;
}

chcp reports code page 850, so the characters æ and Æ should be 145 and 146, respectively.

C:\Users\npostavs\tmp>chcp
Active code page: 850

But we see 230 and 198 reported which matches 1252:

C:\Users\npostavs\tmp>cmd-chars æÆ
230 198

Passing characters outside of codepage causes lossy transformation

Making a shortcut to cmd-chars.exe with arguments αβγ (these are not present in codepage 1252) gives

C:\Users\npostavs\tmp>shortcut-cmd-chars.lnk
97 223 63

Which is aß?.

3条回答
叛逆
2楼-- · 2019-07-02 03:20

You can call CommandLineToArgvW with a call to GetCommandLineW as the first argument to get the command-line arguments in an argv-style array of wide strings. This is the only portable Windows way, especially with the code page mess; Japanese characters can be passed via a Windows shortcut for example. After that, you can use WideCharToMultiByte with a code page argument of CP_UTF8 to convert each wide-character argv element to UTF-8.

Note that calling WideCharToMultiByte with an output buffer size (byte count) of 0 will allow you to determine the number of UTF-8 bytes required for the number of characters specified (or the entire wide string including the null terminator if you wish to pass -1 as the number of wide characters to simplify your code). Then you can allocate the required number of bytes using malloc et al. and call WideCharToMultiByte again with the correct number of bytes instead of 0. If this was performance-critical, a different solution would probably be best, but since this is a one-time function to get command-line arguments, I'd say any decrease in performance would be negligible.

Of course, don't forget to free all of your memory, including calling LocalFree with the pointer returned by CommandLineToArgvW as the argument.

For more info on the functions and how you can use them, click the links to see the MSDN documentation.

查看更多
Rolldiameter
3楼-- · 2019-07-02 03:28

It seems that you are under windows.

In this case, you can make a system() call to run the CHCP command.

   #include <stdlib.h>     // Uses: system()
   #include <stdio.h> 
   // ..... 

   // 1st: Store the present windows codepage in a text file:
   system("CMD /C \"CHCP > myenc.txt\"");

   // 2nd: Read the first line in the file:
   FILE *F = fopen("myenc.txt", "r");      
   char buffer[100];
   fgets(buffer, F);
   fclose(F);

   // 3rd: Analyze the loaded string to find the Windows codepage:
   int codepage = my_CHCP_analizer_func(buffer);   

   // The function my_CHCP_analizer_func() must be written for you,
   // and it has to take in account the way in that CHCP prints the information.  

Finally, the codepages sent by CHCP can be checked for example here:

Windows Codepages

查看更多
仙女界的扛把子
4楼-- · 2019-07-02 03:31

The command-line parameters are in the system default codepage, which varies depending on system settings. Rather than specify a specific source charset at all, you can specify "char" or "" instead and let iconv_open() figure out what the system charset actually is:

iconv_t windows2utf8 = iconv_open("UTF-8", "char");

Otherwise, you are better off retrieving the command-line as UTF-16 instead of as Ansi, and then you can convert it directly to UTF-8 using iconv_open("UTF-8", "UTF-16LE"), or WideCharToMultiByte(CP_UTF8) like Chrono suggested.

查看更多
登录 后发表回答