Unicode character not in range when calling locale

2020-06-03 08:31发布

问题:

I am experiencing an odd behavior when using the locale library with unicode input. Below is a minimum working example:

>>> x = '\U0010fefd'
>>> ord(x)
1113853
>>> ord('\U0010fefd') == 0X10fefd
True
>>> ord(x) <= 0X10ffff
True
>>> import locale
>>> locale.strxfrm(x)
'\U0010fefd'
>>> locale.setlocale(locale.LC_ALL, 'en_US.UTF-8')
'en_US.UTF-8'
>>> locale.strxfrm(x)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: character U+110000 is not in range [U+0000; U+10ffff]

I have seen this on Python 3.3, 3.4 and 3.5. I do not get an error on Python 2.7.

As far as I can see, my unicode input is within the appropriate unicode range, so it seems that somehow something internal to strxfrm when using the 'en_US.UTF-8' is moving the input out of range.

I am running Mac OS X, and this behavior may be related to http://bugs.python.org/issue23195... but I was under the impression this bug would only manifest as incorrect results, not a raised exception. I cannot replicate on my SLES 11 machine, and others confirm they cannot replicate on Ubuntu, Centos, or Windows. It may be instructive to hear about other OS's in the comments.

Can someone explain what may be happening here under the hood?

回答1:

In Python 3.x, the function locale.strxfrm(s) internally uses the POSIX C function wcsxfrm(), which is based on current LC_COLLATE setting. The POSIX standard define the transformation in this way:

The transformation shall be such that if wcscmp() is applied to two transformed wide strings, it shall return a value greater than, equal to, or less than 0, corresponding to the result of wcscoll() applied to the same two original wide-character strings.

This definition can be implemented in multiple ways, and doesn't even require that the resulting string is readable.

I've created a little C code example to demonstrate how it works:

#include <stdio.h>
#include <wchar.h>
#include <locale.h>

int main() {
  wchar_t buf[10];
  wchar_t *in = L"\x10fefd";
  int i;

  setlocale(LC_COLLATE, "en_US.UTF-8");

  printf("in : ");
  for(i=0;i<10 && in[i];i++)
    printf(" 0x%x", in[i]);
  printf("\n");

  i = wcsxfrm(buf, in, 10);

  printf("out: ");
  for(i=0;i<10 && buf[i];i++)
    printf(" 0x%x", buf[i]);
  printf("\n");
}

It prints the string before and after the transformation.

Running it on Linux (Debian Jessie) this is the result:

in : 0x10fefd
out: 0x1 0x1 0x1 0x1 0x552

while running it on OSX (10.11.1) the result is:

in : 0x10fefd
out: 0x103 0x1 0x110000

You can see that the output of wcsxfrm() on OSX contains the character U+110000 which is not permitted in a Python string, so this is the source of the error.

On Python 2.7 the error is not raised because its locale.strxfrm() implementation is based on strxfrm() C function.

UPDATE:

Investigating further, I see that the LC_COLLATE definition for en_US.UTF-8 on OSX is a link to la_LN.US-ASCII definition.

$ ls -l /usr/share/locale/en_US.UTF-8/LC_COLLATE
lrwxr-xr-x 1 root wheel 28 Oct  1 14:24 /usr/share/locale/en_US.UTF-8/LC_COLLATE -> ../la_LN.US-ASCII/LC_COLLATE

I found the actual definition in the sources from Apple. The content of file la_LN.US-ASCII.src is the following:

order \
    \x00;...;\xff

2nd UPDATE:

I've further tested the wcsxfrm() function on OSX. Using the la_LN.US-ASCII collate, given a sequence of wide character C1..Cn as input, the output is a string with this form:

W1..Wn \x01 U1..Un

where

Wx = 0x103 if Cx > 0xFF else Cx+0x3
Ux = Cx+0x103 if Cx > 0xFF else Cx+0x3

Using this algorithm \x10fefd become 0x103 0x1 0x110000

I've checked and every UTF-8 locale use this collate on OSX, so I'm inclined to say that the collate support for UTF-8 on Apple systems is broken. The resulting ordering is almost the same of the one obtained whith normal byte comparison, with the bonus of the ability to obtain illegal Unicode characters.