Does Python bytearray use signed integers in the C

2019-08-24 03:09发布

问题:

I have written a small Cython tool for in-place sorting of structures exposing the buffer protocol in Python. It's a work in progress; please forgive any mistakes. This is just for me to learn.

In my set of unit tests, I am working on testing the in-place sort across many different kinds of buffer-exposing data structures, each with many types of underlying data contained in them. I can verify it is working as expected for most cases, but the case of bytearray is very peculiar.

If you take it for granted that my imported module b in the code below is just performing a straightforward heap sort in Cython, in-place on the bytearray, then the following code sample shows the issue:

In [42]: a #NumPy array
Out[42]: array([  9, 148, 115, 208, 243, 197], dtype=uint8)

In [43]: byt = bytearray(a)

In [44]: byt
Out[44]: bytearray(b'\t\x94s\xd0\xf3\xc5')

In [45]: list(byt)
Out[45]: [9, 148, 115, 208, 243, 197]

In [46]: byt1 = copy.deepcopy(byt)

In [47]: b.heap_sort(byt1)

In [48]: list(byt1)
Out[48]: [148, 197, 208, 243, 9, 115]

In [49]: list(bytearray(sorted(byt)))
Out[49]: [9, 115, 148, 197, 208, 243]

What you can see is that when using sorted, the values are iterated and treated like Python integers for the purpose of sorting, then placed back into a new bytearray.

But the in-place sort, in line 47-48 shows that the bytes are being interpreted as signed integers, and are sorted by their 2's complement value, putting number >= 128, since they are negative, towards the left.

I can confirm it by running over the whole range 0-255:

In [50]: byt = bytearray(range(0,256))

In [51]: b.heap_sort(byt)

In [52]: list(byt)
Out[52]: 
[128,
 129,
 130,
 131,
 132,
 133,
 134,
 135,
 136,
 137,
 138,
 139,
 140,
 141,
 142,
 143,
 144,
 145,
 146,
 147,
 148,
 149,
 150,
 151,
 152,
 153,
 154,
 155,
 156,
 157,
 158,
 159,
 160,
 161,
 162,
 163,
 164,
 165,
 166,
 167,
 168,
 169,
 170,
 171,
 172,
 173,
 174,
 175,
 176,
 177,
 178,
 179,
 180,
 181,
 182,
 183,
 184,
 185,
 186,
 187,
 188,
 189,
 190,
 191,
 192,
 193,
 194,
 195,
 196,
 197,
 198,
 199,
 200,
 201,
 202,
 203,
 204,
 205,
 206,
 207,
 208,
 209,
 210,
 211,
 212,
 213,
 214,
 215,
 216,
 217,
 218,
 219,
 220,
 221,
 222,
 223,
 224,
 225,
 226,
 227,
 228,
 229,
 230,
 231,
 232,
 233,
 234,
 235,
 236,
 237,
 238,
 239,
 240,
 241,
 242,
 243,
 244,
 245,
 246,
 247,
 248,
 249,
 250,
 251,
 252,
 253,
 254,
 255,
 0,
 1,
 2,
 3,
 4,
 5,
 6,
 7,
 8,
 9,
 10,
 11,
 12,
 13,
 14,
 15,
 16,
 17,
 18,
 19,
 20,
 21,
 22,
 23,
 24,
 25,
 26,
 27,
 28,
 29,
 30,
 31,
 32,
 33,
 34,
 35,
 36,
 37,
 38,
 39,
 40,
 41,
 42,
 43,
 44,
 45,
 46,
 47,
 48,
 49,
 50,
 51,
 52,
 53,
 54,
 55,
 56,
 57,
 58,
 59,
 60,
 61,
 62,
 63,
 64,
 65,
 66,
 67,
 68,
 69,
 70,
 71,
 72,
 73,
 74,
 75,
 76,
 77,
 78,
 79,
 80,
 81,
 82,
 83,
 84,
 85,
 86,
 87,
 88,
 89,
 90,
 91,
 92,
 93,
 94,
 95,
 96,
 97,
 98,
 99,
 100,
 101,
 102,
 103,
 104,
 105,
 106,
 107,
 108,
 109,
 110,
 111,
 112,
 113,
 114,
 115,
 116,
 117,
 118,
 119,
 120,
 121,
 122,
 123,
 124,
 125,
 126,
 127]

I know this is difficult to reproduce. You can build the linked package with Cython if you want, and then import src.buffersort as b to get the same sort functions I am using.

I've tried reading through the source code for bytearray in Objects/bytearrayobject.c, but I see some references to long and a few calls to PyInt_FromLong ...

This makes me suspect that the underlying C-level data of a bytearray is represented as a signed integer in C, but the conversion to Python int from raw bytes means it is unsigned between 0 and 255 in Python. I can only assume this is true ... though I don't see why Python should interpret the C long as unsigned, unless that is merely a convention for bytearray that I didn't see in the code. But if so, why wouldn't an unsigned integer be used on the C side as well, if the bytes are always treated by Python as unsigned?

If true, what should be considered the "right" result of the in-place sort? Since they are "just bytes" either interpretation is valid, I guess, but in Python spirit I think their should be one way which is considered the standard.

To match output of sorted, will it be sufficient on the C side to cast values to unsigned long when dealing with bytearray?

回答1:

Does Python bytearray use signed integers in the C representation?

It uses chars. Whether those are signed depends on the compiler. You can see this in Include/bytearrayobject.h. Here's the 2.7 version:

/* Object layout */
typedef struct {
    PyObject_VAR_HEAD
    /* XXX(nnorwitz): should ob_exports be Py_ssize_t? */
    int ob_exports; /* how many buffer exports */
    Py_ssize_t ob_alloc; /* How many bytes allocated */
    char *ob_bytes;
} PyByteArrayObject;

and here's the 3.5 version:

typedef struct {
    PyObject_VAR_HEAD
    Py_ssize_t ob_alloc; /* How many bytes allocated in ob_bytes */
    char *ob_bytes;      /* Physical backing buffer */
    char *ob_start;      /* Logical start inside ob_bytes */
    /* XXX(nnorwitz): should ob_exports be Py_ssize_t? */
    int ob_exports;      /* How many buffer exports */
} PyByteArrayObject;

If true, what should be considered the "right" result of the in-place sort?

A Python bytearray represents a sequence of integers in the range 0 <= elem < 256, regardless of whether the compiler considers chars to be signed. You should probably sort it as a sequence of integers in the range 0 <= elem < 256, rather than as a sequence of signed chars.

To match output of sorted, will it be sufficient on the C side to cast values to unsigned long when dealing with bytearray?

I don't know enough about Cython to say what the correct code change would be.