I am working on a internationalization project. Do other languages, such as Arabic or Chinese, use different representations for digits besides 0-9? If so, are there versions of atoi() that will account for these other representations?
I should add that I am mainly concerned with parsing input from the user. If the users types in some other representation I want to be sure that I recognize it as a number and treat it accordingly.
I may use std::wistringstream
and locale to generate this integer.
#include <sstream>
#include <locale>
using namespace std;
int main()
{
locale mylocale("en-EN"); // Construct locale object with the user's default preferences
wistringstream wss(L"1"); // your number string
wss.imbue( mylocale ); // Imbue that locale
int target_int = 0;
wss >> target_int;
return 0;
}
More info on stream class and on locale class.
If you are concerned about international characters, then you need to ensure you use an "Unicode-aware" function such as _wtoi(..).
You can also check if UNICODE is supported to make it type independent (from MSDN):
TCHAR tstr[4] = TEXT("137");
#ifdef UNICODE
size_t cCharsConverted;
CHAR strTmp[SIZE]; // SIZE equals (2*(sizeof(tstr)+1)). This ensures enough
// room for the multibyte characters if they are two
// bytes long and a terminating null character. See Security
// Alert below.
wcstombs_s(&cCharsConverted, strTmp, sizeof(strTmp), (const wchar_t *)tstr, sizeof(strTmp));
num = atoi(strTmp);
#else
int num = atoi(tstr);
#endif
In this example, the standard C
library function wcstombs translates
Unicode to ASCII. The example relies
on the fact that the digits 0 through
9 can always be translated from
Unicode to ASCII, even if some of the
surrounding text cannot. The atoi
function stops at any character that
is not a digit.
Your application can use the National
Language Support (NLS) LCMapString
function to process text that includes
the native digits provided for some of
the scripts in Unicode.
Caution Using the wcstombs function
incorrectly can compromise the
security of your application. Make
sure that the application buffer for
the string of 8-bit characters is at
least of size 2*(char_length +1),
where char_length represents the
length of the Unicode string. This
restriction is made because, with
double-byte character sets (DBCSs),
each Unicode character can be mapped
to two consecutive 8-bit characters.
If the buffer does not hold the entire
string, the result string is not
null-terminated, posing a security
risk. For more information about
application security, see Security
Considerations: International
Features.