The signedness of char is not standardized. Hence there are signed char
and unsigned char
types. Therefore functions which work with single character must use the argument type which can hold both signed char and unsigned char (this
type was chosen to be int
), because if the argument type was char
, we would
get type conversion warnings from the compiler (if -Wconversion is used) in code like this:
char c = 'ÿ';
if (islower((unsigned char) c)) ...
warning: conversion to ‘char’ from ‘unsigned char’ may change the sign of the result
(here we consider what would happen if the argument type of islower() was char)
And the thing which makes it work without explicit typecasting is automatic promotion
from char
to int
.
Further, the ISO C90 standard, where wchar_t
was introduced, does not say anything
specific about the representation of wchar_t
.
Some quotations from glibc reference:
it would be legitimate to define
wchar_t
aschar
if
wchar_t
is defined aschar
the typewint_t
must be defined asint
due to the parameter promotion.
So, wchar_t
can perfectly well be defined as char
, which means that similar rules
for wide character types must apply, i.e., there may be implementations where
wchar_t
is positive, and there may be implementations where wchar_t
is negative.
From this it follows that there must exist unsigned wchar_t
and signed wchar_t
types (for the same reason as there are unsigned char
and signed char
types).
Private communication reveals that an implementation is allowed to support wide
characters with >=0 value only (independently of signedness of wchar_t
). Anybody knows what this means? Does thin mean that when wchar_t
is 16-bit
type (for example), we can only use 15 bits to store the value of wide character?
In other words, is it true that a sign-extended wchar_t
is a valid value?
See also this question.
Also, private communication reveals that the standard requires that any valid value of wchar_t
must
representable by wint_t
. Is it true?
Consider this example:
#include <locale.h>
#include <ctype.h>
int main (void)
{
setlocale(LC_CTYPE, "fr_FR.ISO-8859-1");
/* 11111111 */
char c = 'ÿ';
if (islower(c)) return 0;
return 1;
}
To make it portable, we need the cast to '(unsigned char)'.
This is necessary because char
may be the equivalent signed char
,
in which case a byte where the top bit is set would be sign
extended when converting to int
, yielding a value that is outside
the range of unsigned char
.
Now, why is this scenario different from the following example for wide characters?
#include <locale.h>
#include <wchar.h>
#include <wctype.h>
int main(void)
{
setlocale(LC_CTYPE, "");
wchar_t wc = L'ÿ';
if (iswlower(wc)) return 0;
return 1;
}
We need to use iswlower((unsigned wchar_t)wc)
here, but
there is no unsigned wchar_t
type.
Why there are no unsigned wchar_t
and signed wchar_t
types?
UPDATE
Are the standards saying that casting to unsigned int
and to int
in the following two programs is guaranteed to be correct?
(I just replaced wint_t
and wchar_t
to their actual meaning in glibc)
#include <locale.h>
#include <wchar.h>
int main(void)
{
setlocale(LC_CTYPE, "en_US.UTF-8");
unsigned int wc;
wc = getwchar();
putwchar((int) wc);
}
--
#include <locale.h>
#include <wchar.h>
#include <wctype.h>
int main(void)
{
setlocale(LC_CTYPE, "en_US.UTF-8");
int wc;
wc = L'ÿ';
if (iswlower((unsigned int) wc)) return 0;
return 1;
}
TL;DR:
Because C's wide-character handling facilities were defined such that they are not needed.
In more detail,
To be precise, "The implementation shall define char to have the same range, representation, and behavior as either signed char or unsigned char." (C2011, 6.2.5/15)
"Hence" implies causation, which would be hard to argue clearly, but certainly
signed char
andunsigned char
are more appropriate when you want to handle numbers, as opposed to characters.No, not at all. Standard library functions that work with individual characters could easily be defined in terms of type
char
, regardless of whether that type is signed, because the library implementation does know its signedness. If that were a problem then it would apply equally to the string functions, too --char
would be useless.Your example of
getchar()
is non-apposite. It returnsint
rather than a character type because it needs to be able to return an error indicator that does not correspond to any character. Moreover, the code you present does not correspond to the accompanying warning message: it contains a conversion fromint
tounsigned char
, but no conversion fromchar
tounsigned char
.Some other character-handling functions accept
int
parameters or return values of typeint
both for compatibility withgetchar()
and other stdio functions, and for historic reasons. In days of yore, you couldn't actually pass achar
at all -- it would always be promoted toint
, and that is what the functions would (and must) accept. One cannot later change the argument type, evolution of the language notwithstanding.C90 isn't really relevant any longer, but no doubt it says something very similar to C2011 (7.19/2), which describes
wchar_t
asYour quotations from the glibc reference are non-authoritative, except possibly for glibc only. They appear in any case to be commentary, not specification, and its unclear why you raise them. Certainly, though, at least the first is correct. Referring to the standard, if all the members of the largest extended character set specified among the locales supported by a given implementation could fit in a
char
then that implementation could definewchar_t
aschar
. Such implementations used to be much more common than they are today.You ask several questions:
I think it means that whoever communicated that to you doesn't know what they are talking about, or perhaps that what they are talking about is something different than the requirements placed by the C standard. You will find that in practice, character sets are defined with only non-negative character codes, but that is not a constraint placed by the C standard.
The C standard does not say or imply that. You can store the value of any supported character in a
wchar_t
. In particular, if an implementation supports a character set containing character codes exceeding 32767, then you can store those in awchar_t
.The C standard does not say or imply that. It does not even say whether
wchar_t
is a signed type (if not, then sign extension is meaningless for it). If it is a signed type, then there is no guarantee about whether sign-extending a value representing a character in some supported character set (which value could, in principle, be negative) will produce a value that also represents a character in that character set, or in any other supported character set. The same is true of adding 1 to awchar_t
value.It depends what you mean by "valid". The standard says that
wint_t
(C2011, 7.29.1/2)
wchar_t
must be able to hold any value corresponding to a member of the extended character set, in any supported locale.wint_t
must be able to hold all of those values, too. It may be, however, thatwchar_t
is capable of representing values that do not correspond to any character in any supported character set. Such values are valid in the sense that the type can represent them.wint_t
is not required to be able to represent such values.For example, if the largest extended character set of any supported locale uses character codes up to but not exceeding 32767, then an implementation would be free to implement
wchar_t
as an unsigned 16-bit integer, andwint_t
as a signed 16-bit integer. The values representable bywchar_t
that do not correspond to extended characters are then not representable bywint_t
(butwint_t
still has many candidates for its required value that does not correspond to any character).With respect to the character and wide-character classification functions, the only answer is that the differences simply arise from different specifications. The
char
classification functions are defined to work with the same values thatgetchar()
is defined to return -- either -1 or a character value converted, if necessary, tounsigned char
. The wide character classification functions, on the other hand, accept arguments of typewint_t
, which can represent the values of all wide-character unchanged, therefore there is no need for a conversion.You claim in this regard that
No and maybe. You do not need to convert the
wchar_t
argument toiswlower()
to any other type, and in particular, you do not need to convert it to an explicitly unsigned type. The wide character classification functions are not analogous to the regular character classification functions in this respect, having been designed with the benefit of hindsight. As forunsigned wchar_t
, C does not require such a type to exist, so portable code should not use it, but it may exist in some implementations.Regarding the update appended to the question:
The standard says nothing of the sort about conforming implementations in general. I'll suppose, however, that you mean to ask specifically about conforming implementations for which
wchar_t
isint
andwint_t
isunsigned int
.On such an implementation, your first program is flawed because it does not account for the possibility that
getwchar()
returnsWEOF
. ConvertingWEOF
to typewchar_t
, if doing so does not cause a signal to be raised, is not guaranteed to produce a value that corresponds to any wide character. Passing the result of such a conversion toputwchar()
therefore does not exhibit defined behavior. Moreover, ifWEOF
is defined with the same value asUINT_MAX
(which is not representable byint
) then the conversion of that value toint
has implementation-defined behavior independently of theputwchar()
call.On the other hand, I think the key point you are struggling with is that if the value returned by
getwchar()
in the first program is notWEOF
, then it is guaranteed to be one that is unchanged by conversion towchar_t
. Your first program will perform as appears to be intended in that case, but the cast toint
(orwchar_t
) is unnecessary.Similarly, the second program is correct provided that the wide-character literal corresponds to a character in the applicable extended character set, but the cast is unnecessary and changes nothing. The
wchar_t
value of such a literal is guaranteed to be representable by typewint_t
, so the cast changes the type of its operand, but not the value. (But if the literal does not correspond to a character in the extended character set then the behavior is implementation-defined.)On the third hand, if your objective is to write strictly-conforming code then the right thing to do, and indeed the intended usage mode of these particular wide-character functions, would be this:
and this: