Usually when I want my program to use UTF-8 encoding, I write setlocale (LC_ALL, "");
. But today I found that it's just setting locate to environment's default locale, and I can't know whether the environment is using UTF-8 by default.
I wonder is there any way to force the character encoding to be UTF-8? Also, is there any way to check whether my program is using UTF-8?
Try:
You can run
locale -a
in the terminal to get a full list of locales supported by your system ("en_US.UTF-8" should be supported by most/all UTF-8 supporting systems).EDIT 1 (alternate spelling)
In the comments, Lee points out that some systems have an alternate spelling,
"en_US.utf8"
(which surprised me, but we learn new stuff every day).Since
setlocale
returns NULL when it fails, you can chain these calls:EDIT 2 (finding out if we're using UTF-8)
To find out if the locale is set to UFT-8 (after attempting to set it), you can either check for the returned value (
NULL
means the call failed) or check the locale used.Option 1:
Option 2:
This is not an answer, but a third, quite complex example, on how to use wide character I/O. This was too long to add to my actual answer to this question.
This example shows how to read and process CSV files (RFC-4180 format, optionally with limited backslash escape support) using wide strings.
The following code is CC0/public domain, so you are free to use it any way you like, even include in your own proprietary projects, but if it breaks anything, you get to keep all the bits and not complain to me. (I'll be happy to include any bug fixes if you find and report them in a comment below, though.)
The logic of the code is robust, however. In particular, it supports universal newlines, all four common newline types: Unix-like LF (
\n
), old CR LF (\r\n
), old Mac CR (\r
), and the occasionally encountered weird LF CR (\n\r
). There are no built-in limitations wrt. the length of a field, the number of fields in a record, or the number of records in a file. It works very nicely if you need to convert CSV or process CSV input stream-like (field by field or record-by-record), without having to have more than one in memory at one point. If you want to construct structures to describe the records and fields in memory, you'll need to add some scaffolding code for that.Because of universal newline support, when reading input interactively, this program might require two consecutive end-of-inputs (Ctrl+Z in Windows and MS-DOS, Ctrl+D everywhere else), as the first one is usually "consumed" by the
csv_next_field()
orcsv_skip_field()
function, and thecsv_next_record()
function needs to re-read it again to actually detect it. However, you do not normally ask the user to input CSV data interactively, so this should be an acceptable quirk.The use of the above
csv_next_field()
,csv_skip_field()
, andcsv_next_record()
is quite straightforward.Open the CSV file normally, then call
fwide(stream, 1)
on it to tell the C library you intend to use the wide string variants instead of the standard narrow string I/O functions.Create four variables, and initialize the first two:
field
is a pointer to the dynamically allocated contents of each field you read. It is allocated automatically; essentially, you don't need to worry about it at all.allocated
holds the currently allocated size (in wide characters, including terminatingL'\0'
), and we'll uselength
andstatus
later.At this point, you are ready to read or skip the first field in the first record.
You do not wish to call
csv_next_record()
at this point, unless you wish to skip the very first record entirely in the file.Call
status = csv_skip_field(stream);
to skip the next field, orstatus = csv_next_field(stream, &field, &allocated, &length);
to read it.If
status == CSV_OK
, you have the field contents in wise stringfield
. It haslength
wide characters in it.If
status == CSV_END
, there was no more fields in the current record. (Thefield
is unchanged, and you should not examine it.)Otherwise,
status < 0
, and it describes an error code. You can usecsv_error(status)
to obtain a (narrow) string describing it.At any point, you can move (skip) to the start of the next record by calling
status = csv_next_record(stream);
.If it returns
CSV_OK
, there might be a new record available. (We only know when you try to read or skip the first field. This is similar to how standard C library functionfeof()
only tells you whether you have tried to read past the end of input, it does not tell whether there is more data available or not.)If it returns
CSV_END
, you already have processed the last record, and there are no more records.Otherwise, it returns a negative error code,
status < 0
. You can usecsv_error(status)
to obtain a (narrow) string describing it.After you are done, discard the field buffer:
You do not actually need to reset the variables to
NULL
and zero, but I recommend it. In fact, you can do the above at any point (when you are no longer interested in the contents of the current field), as thecsv_next_field()
will then automatically allocate a new buffer as necessary.Note that
free(NULL);
is always safe and does nothing. You do not need to check iffield
isNULL
or not before freeing it. This is also the reason why I recommend initializing the variables immediately when you declare them. It just makes everything so much easier to handle.The compiled example program takes one or more CSV file names as command-line parameters, then reads the files and reports the contents of each field in the file. If you have a particularly fiendishly complex CSV file, this is optimal for checking if this approach reads all the fields correctly.
It is possible, but it is the completely wrong thing to do.
First of all, the current locale is for the user to decide. It is not just the character set, but also the language, date and time formats, and so on. Your program has absolutely no "right" to mess with it.
If you cannot localize your program, just tell the user the environmental requirements your program has, and let them worry about it.
Really, you should not really rely on UTF-8 being the current encoding, but use wide character support, including functions like
wctype()
,mbstowcs()
, and so on. POSIXy systems also provideiconv_open()
andiconv()
function family in their C libraries to convert between encodings (which should always include conversion to and fromwchar_t
); on Windows, you need a separate versionlibiconv
library. This is how for example the GCC compiler handles different character sets. (Internally, it uses Unicode/UTF-8, but if you ask it to, it can do the necessary conversions to work with other character sets.)I am personally a strong proponent of using UTF-8 everywhere, but overriding the user locale in a program is horrific. Abominable. Distasteful; like a desktop applet changing the display resolution because the programmer is particularly fond of certain one.
I would be happy to write some example code to show how to correctly solve any character-set-sensible situation, but there are so many, I don't know where to start.
If the OP amends their question to state exactly what problem overriding the character set is supposed to solve, I'm willing to show how to use the aforementioned utilities and POSIX facilities (or equivalent freely available libraries on Windows) to solve it correctly.
If this seems harsh to someone, it is, but only because taking the easy and simple route here (overriding the user's locale setting) is so ... wrong, purely on technical grounds. Even no action is better, and actually quite acceptable, as long as you just document your application only handles UTF-8 input/output.
Example 1. Localized Happy New Year!
Note that wprintf() takes a wide string (wide string constants are of form
L""
, wide character constantsL''
, as opposed to normal/narrow counterparts""
and''
). Formats are still the same;%s
prints a normal/narrow string, and%ls
a wide string.Example 2. Reading input lines from standard input, and optionally saving them to a file. The file name is supplied on the command line.
The
getwline()
function above is pretty much at the most complicated end of functions you might need when dealing with localized wide character support. It allows you to read localized input lines without length restrictions, and optionally trims and cleans up (removing control codes and embedded binary zeros) the returned string. It also works fine with both LF and CR-LF (\n
and\r\n
) newline encodings.