Allocating a dynamic array in a dynamically alloca

2019-06-01 05:02发布

问题:

This question is really about how to use variable-length types in the Python/C API (PyObject_NewVar, PyObject_VAR_HEAD, PyTypeObject.tp_basicsize and .tp_itemsize , but I can ask this question without bothering with the details of the API. Just assume I need to use an array inside a struct.

I can create a list data structure in one of two ways. (I'll just talk about char lists for now, but it doesn't matter.) The first uses a pointer and requires two allocations. Ignoring #includes and error handling:

struct listptr {
    size_t elems;
    char *data;
};
struct listptr *listptr_new(size_t elems) {
    size_t basicsize = sizeof(struct listptr), itemsize = sizeof(char);
    struct listptr *lp;
    lp = malloc(basicsize);
    lp->elems = elems;
    lp->data = malloc(elems * itemsize);
    return lp;
}

The second way to create a list uses array notation and one allocation. (I know this second implementation works because I've tested it pretty thoroughly.)

struct listarray {
    size_t elems;
    char data[1];
};
struct listarray *listarray_new(size_t elems) {
    size_t basicsize = offsetof(struct listarray, data), itemsize = sizeof(char);
    struct listarray *la;
    la = malloc(basicsize + elems * itemsize);
    la->elems = elems;
    return lp;
}

In both cases, you then use lp->data[index] to access the array.

My question is why does the second method work? Why do you declare char data[1] instead of any of char data[], char data[0], char *data, or char data? In particular, my intuitive understanding of how structs work is that the correct way to declare data is char data with no pointer or array notation at all. Finally, are my calculations of basicsize and itemsize correct in both implementations? In particular, is this use of offsetof guaranteed to be correct for all machines?

Update

Apparently this is called a struct hack: In C99, you can use a flexible array member:

struct listarray2 {
    size_t elems;
    char data[];
}

with the understanding that you'll malloc enough space for data at runtime. Before C99, the data[1] declaration was common. So my question now is why declare char data[1] or char data[] instead of char *data or char data?

回答1:

The reason you'd declare char data[1] or char data[] instead of char *data or char data is to keep your structure directly serializable and deserializable. This is important in cases where you'll be writing these sorts of structures to disk or over a network socket, etc.

Take for example your first code snippet that requires two allocations. Your listptr type is not directly serializable. i.e. listptr.elems and the data pointed to by listptr.data are not in a contiguous piece of memory. There is no way to read/write this structure to/from disk with a generic function. You need a custom function that is specific to your struct listptr type to do it. i.e. On serialize you'd have to first write elems to disk, and then write the data pointed to by the data pointer. On deserialization you'd have to read elems, allocate the appropriate space to listptr.data and then read the data from disk.

Using a flexible array member solves this problem because listptr.elem and the listptr.data reside in a contiguous memory space. So to serialize it you can simply write out the total allocated size for the structure and then the structure itself. On deserialize you then first read the allocated size, allocate the needed space and then read your listptr struct into that space.

You may wonder why you'd ever really need this, but it can be an invaluable feature. Consider a data stream of heterogeneous types. Provided you define a header that defines the which heterogeneous type you have and its size and precede each type in the stream with this header, you can generically serialize and deserialize data stream very elegantly and efficiently.

The only reason I know of for choosing char data[1] over char data[] is if you are defining an API that needs to be portable between C99 and C++ since C++ does not have support for flexible array members.

Also, wanted to point out that in the char data[1] you can do the following to get the total needed structure size:

size_t totalsize = offsetof(struct listarray, data[elems]);

You also ask why you wouldn't use char data instead of char data[1] or char data[]. While technically possible to use just plain old char data, it would be (IMHO) morally shunned. The two main issues with this approach are:

  1. You wanted an array of chars, but now you can't access the data member directly as an array. You need to point a pointer to the address of data to access it as an array. i.e.

    char *as_array = &listarray.data;

  2. Your structure definition (and your code's use of the structure) would be totally misleading to anyone reading the code. Why declare a single char when you really meant an array of char?

Given these two things, I don't know why anyone would use char data in favor of char data[1]. It just doesn't benefit anyone given the alternatives.