Is it safe to assume they are ISO-8859-15 (Window-1252?), or is there some function I can call to query this? The end goal is to conversion to UTF-8.
Background:
The problem described by this question arises because XMLStarlet assumes its command line arguments are UTF-8. Under Windows it seems they are actually ISO-8859-15 (Window-1252?), or at least adding the following to the beginning of main
makes things work:
char **utf8argv = malloc(sizeof(char*) * (argc+1));
utf8argv[argc] = NULL;
{
iconv_t windows2utf8 = iconv_open("UTF-8", "ISO-8859-15");
int i;
for (i = 0; i < argc; i++) {
const char *arg = argv[i];
size_t len = strlen(arg);
size_t outlen = len*2 + 1;
char *utfarg = malloc(outlen);
char *out = utfarg;
size_t ret = iconv(windows2utf8,
&arg, &len,
&out, &outlen);
if (ret < 0) {
perror("iconv");
utf8argv[i] = NULL;
continue;
}
out[0] = '\0';
utf8argv[i] = utfarg;
}
argv = utf8argv;
}
Testing Encoding
The following program prints out the bytes of its first argument in decimal:
#include <strings.h>
#include <stdio.h>
int main(int argc, char *argv[])
{
for (int i = 0; i < strlen(argv[1]); i++) {
printf("%d ", (unsigned char) argv[1][i]);
}
printf("\n");
return 0;
}
chcp
reports code page 850, so the characters æ and Æ should be 145 and 146, respectively.
C:\Users\npostavs\tmp>chcp
Active code page: 850
But we see 230 and 198 reported which matches 1252:
C:\Users\npostavs\tmp>cmd-chars æÆ
230 198
Passing characters outside of codepage causes lossy transformation
Making a shortcut to cmd-chars.exe
with arguments αβγ
(these are not present in codepage 1252) gives
C:\Users\npostavs\tmp>shortcut-cmd-chars.lnk
97 223 63
Which is aß?
.
You can call CommandLineToArgvW with a call to GetCommandLineW as the first argument to get the command-line arguments in an
argv
-style array of wide strings. This is the only portable Windows way, especially with the code page mess; Japanese characters can be passed via a Windows shortcut for example. After that, you can use WideCharToMultiByte with a code page argument ofCP_UTF8
to convert each wide-characterargv
element to UTF-8.Note that calling
WideCharToMultiByte
with an output buffer size (byte count) of 0 will allow you to determine the number of UTF-8 bytes required for the number of characters specified (or the entire wide string including the null terminator if you wish to pass -1 as the number of wide characters to simplify your code). Then you can allocate the required number of bytes usingmalloc
et al. and callWideCharToMultiByte
again with the correct number of bytes instead of 0. If this was performance-critical, a different solution would probably be best, but since this is a one-time function to get command-line arguments, I'd say any decrease in performance would be negligible.Of course, don't forget to free all of your memory, including calling
LocalFree
with the pointer returned byCommandLineToArgvW
as the argument.For more info on the functions and how you can use them, click the links to see the MSDN documentation.
It seems that you are under windows.
In this case, you can make a
system()
call to run theCHCP
command.Finally, the codepages sent by CHCP can be checked for example here:
Windows Codepages
The command-line parameters are in the system default codepage, which varies depending on system settings. Rather than specify a specific source charset at all, you can specify
"char"
or""
instead and leticonv_open()
figure out what the system charset actually is:Otherwise, you are better off retrieving the command-line as UTF-16 instead of as Ansi, and then you can convert it directly to UTF-8 using
iconv_open("UTF-8", "UTF-16LE")
, orWideCharToMultiByte(CP_UTF8)
like Chrono suggested.