atoi() with other languages

I am working on a internationalization project. Do other languages, such as Arabic or Chinese, use different representations for digits besides 0-9? If so, are there versions of atoi() that will account for these other representations?

I should add that I am mainly concerned with parsing input from the user. If the users types in some other representation I want to be sure that I recognize it as a number and treat it accordingly.

回答1:

I may use std::wistringstream and locale to generate this integer.

#include <sstream>
#include <locale>
using namespace std;

int main()
{
  locale mylocale("en-EN"); // Construct locale object with the user's default preferences
  wistringstream wss(L"1");  // your number string
  wss.imbue( mylocale );    // Imbue that locale
  int target_int = 0;
  wss >> target_int;
  return 0;
}

More info on stream class and on locale class.

回答2:

If you are concerned about international characters, then you need to ensure you use an "Unicode-aware" function such as _wtoi(..).

You can also check if UNICODE is supported to make it type independent (from MSDN):

TCHAR tstr[4] = TEXT("137");

#ifdef UNICODE
size_t cCharsConverted;
CHAR strTmp[SIZE]; // SIZE equals (2*(sizeof(tstr)+1)). This ensures enough
                   // room for the multibyte characters if they are two 
                   // bytes long and a terminating null character. See Security 
                   // Alert below. 

wcstombs_s(&cCharsConverted, strTmp, sizeof(strTmp), (const wchar_t *)tstr, sizeof(strTmp));
num = atoi(strTmp);

#else

int num = atoi(tstr);

#endif

In this example, the standard C library function wcstombs translates Unicode to ASCII. The example relies on the fact that the digits 0 through 9 can always be translated from Unicode to ASCII, even if some of the surrounding text cannot. The atoi function stops at any character that is not a digit.

Your application can use the National Language Support (NLS) LCMapString function to process text that includes the native digits provided for some of the scripts in Unicode.

Caution Using the wcstombs function incorrectly can compromise the security of your application. Make sure that the application buffer for the string of 8-bit characters is at least of size 2*(char_length +1), where char_length represents the length of the Unicode string. This restriction is made because, with double-byte character sets (DBCSs), each Unicode character can be mapped to two consecutive 8-bit characters. If the buffer does not hold the entire string, the result string is not null-terminated, posing a security risk. For more information about application security, see Security Considerations: International Features.