Does the underlying character set depend only on t

2019-02-22 10:23发布

问题:

Many texts warn that processing char values as integers isn't portable, e.g. assuming that the value of 'A' is 65 (as in ASCII).

But what determines whether this character set is ASCII (or an extended form), or some other character set? Is it determined by the operating system, or the compiler? I'm presuming that this isn't dependent on the hardware.

For example, could an Intel PC have a character set such as EBCDIC (in theory)? And could changing the LANG environment variable in Linux/Unix change the values of the basic character set for C programs (if then recompiled)?

(edit: I see now that the various non-Latin character sets in Linux all have the same basic ASCII codes, e.g. KOI8-U - I assumed that there were variations that had character sets not compatible with ASCII)

回答1:

The standard doesn't care about any of those details, as far as it's concerned there's only "the implementation".

In practice, hardware and OSes can both specify implementation details that C implementations on that platform are expected to use, or that they're required to use if they want to inter-operate with system functions (that is to say, code that is supplied with the OS or with the hardware). So we often say things like, "on Win32, sizeof(void*) == 4". This is a shorthand, though, since someone could, if they chose, write a C implementation that runs on 32 bit Windows and has a different pointer size. What we really mean is, "in the Win32 ABI, sizeof(void*) == 4, and C implementations running on Win32 that don't follow the Win32 ABI are excluded from consideration".

Implementations therefore can do whatever they like, provided they don't mind whether or not they can (for example) use dlls that follow the system's conventions. The character set can be defined however the writer of the compiler and standard libraries likes, subject only to what's in the standard.

That said, the values of character literals are compile-time constants. This tells you that the basic execution character set cannot change during runtime.

Furthermore, if it were to depend on an environment variable then it would be somebody's responsibility to ensure that the program was run with the same value that it was compiled with. This would be pretty user-unfriendly, but the standard doesn't actually forbid someone from writing a C implementation with peculiar restrictions on how programs are run.



回答2:

The C standard says this:

§5.2.1/1 in C99

Two sets of characters and their associated collating sequences shall be defined: the set in which source files are written (the source character set), and the set interpreted in the execution environment (the execution character set). Each set is further divided into a basic character set, whose contents are given by this subclause, and a set of zero or more locale-specific members (which are not members of the basic character set) called extended characters. The combined set is also called the extended character set. The values of the members of the execution character set are implementation-defined.

At startup the compiler must use the C locale, it will only pick up the OS's locale, when setlocale(LC_ALL, ""); is called.



回答3:

The compiler clearly determines which source and execution character set is used, since cross-compilation can occur (eg. compiling code for an IBM mainframe that uses EBCDIC on your Linux box that uses ASCII).



标签: c ascii