How to implement a variable-length ‘string’-y in C

2019-02-03 21:12发布

问题:

I’ve googled quite a bit, but I can’t find information on how variable-length strings are generally implemented in higher-level languages. I’m creating my own such language, and am not sure where to start with strings.

I have a struct describing the string type, and then a create function that allocates such a ‘string’:

/* A safer `strcpy()`, using `strncpy()` and `sizeof()` */
#define STRCPY(TO, FROM) \
  strncpy(TO, FROM, sizeof(TO)); TO[sizeof(TO) - 1] = '\0'

struct string {
  // …
  char  native[1024];
};

string String__create(char native[]) {
  string this = malloc(sizeof(struct string));

  // …
  STRCPY(this->native, native);

  return this;
}

However, that would only allow 1kb-long strings. That’s sort of silly, and a huge waste of memory in most cases.

Given that I have to declare the memory to be used somehow… how do I go about implementing a string that can (efficiently) store an (effectively) unbounded number of characters?

回答1:

Many C++ std::string implementations now use a "Small String Optimization". In pseudo-code:

struct string {
    Int32 length
    union {
        char[12] shortString
        struct {
           char* longerString
           Int32 heapReservedSpace
        }
    }
}

The idea is that string up to 12 characters are stored in the shortString array. The entire string will be contiguous and use only a single cache line. Longer strings are stored on the heap. This leaves you with 12 spare bytes in the string object. The pointer doesn't take all of that, so you can also remember how much memory you've allocated on the heap (>=length). That helps to support scenario's in which you grow a string in small increments.



回答2:

The common approach is to have a field for length and a pointer to a dynamically allocated region of memory to hold the characters.

typedef struct string {
    size_t length;
    unsigned char *native;
} string_t;

string_t String__create(char native[]) {
    string_t this;
    this.length = strlen(native);
    this.native = malloc(this.length+1);
    if (this.native) {
        strncpy(this.native, native, this.length+1);
    } else {
        this.length = 0;
    }
    return this;
}

If you want to dynamically allocate the string_t:

string_t* String__create(char native[]) {
    string_t* this;
    if (this = malloc(sizeof(*this))) {
        this->length = strlen(native);
        this->native = malloc(this->length+1);
        if (this->native) {
            strncpy(this->native, native, this->length+1);
        } else {
            free(this);
            this=NULL;
        }
    }
    return this;
}
void String__delete(string_t** str) {
    free((*str)->native);
    free((*str));
    *str=NULL;
}


回答3:

In addition to what others have told you, I'd also include the concept of "capacity": It is not possible to know the size of the allocated vector in memory, you must store it. If you do a sizeof of the String struct, it will return you 4 bytes * 3 numeric fields = 12 bytes (probably bigger due to padding used in structures). Also, you cannot get the length of allocated memory through sizeof.

typedef struct _mystring {
        char * native;
        size_t size;
        size_t capacity;
} String;

This way, capacity always bears the actual size of the chunk in which your string is. Say that your string goes shorter: you don't have to realloc in order to get an exact match between the capacity and the size of your string. Also, you can alloc from the beginning the characters you expect the string to have, and not the characters the initial string has. Finally, you can mimic the C++ string char dynamic vector and double capacity each time the string grows beyond the capacity limit. All of these will keep memory operations to a minimum, which will translate in better performance (you will also waste some memory, but never as much as 1Kb).

String String__create(char native[], size_t capacity) {
  String this;

  this.size = strlen( native );
  if ( capacity < ( this.size + 1 ) )
        this.capacity = this.size + 1;
  else  this.capacity = capacity;

  this.native = (char *) malloc( capacity * sizeof( char ) );
  strcpy( this.native, native );

  return this;
}

String * String__set(String *this, char native[]) {
    this->size = strlen( native );

    if ( this->size >= this->capacity ) {
        do {
            this->capacity <<= 1;
        } while( this->size > this->capacity );

        this->native = realloc( this->native, this->capacity );
    }

    strcpy( this->native, native );

    return this;
}

void String__delete(String *this)
{
    free( this->native );
}


回答4:

realloc will relocate your memory to a bigger buffer -- simply using this command will allow you to resize your string. Use the following struct for your string:

struct mystring
{
    char * string;
    size_t size;
};

The important part being keeping a track of the size of your string, and having every string manipulation function testing if the size makes sense.

You could pre-allocate a large buffer and keep adding to it, and only realloc when said buffer is full -- you have to implement all the functions for this. It is preferable (far less error prone, and more secure) to mutate string by moving from one immutable string to another (using the semantics of realoc).



回答5:

Some people prefer to use the "rope" data structure to store a string of characters of unbounded length, rather than a contiguous string (C string).

A simplified rope can be defined something like:

#include <stdio.h>

struct non_leaf_rope_node{
    char zero;
    union rope* left_substring;
    union rope* right_substring;
    // real rope implementations have a few more items here
};
#define rope_leaf_max ( sizeof( struct non_leaf_rope_node ) )

typedef union rope {
    char rope_data[ rope_leaf_max ];
    struct non_leaf_rope_node pointers;
} rope;

void print( union rope *node ){
    if( node->rope_data[0] != '\0' ){
        // short literal data
        fputs( node->rope_data, stdout );
    }else{
        // non-root node
        print( node->pointers.left_substring );
        print( node->pointers.right_substring );
    };
};
// other details involving malloc() and free() go here

int main(void){
    rope left = { "Hello," };
    rope right = { " World!" };
    rope root = {0,0,0};
    root.pointers.left_substring = &left;
    root.pointers.right_substring = &right;
    print( &root );

    return 0;
};

A rope with less than rope_leaf_max characters is stored the same as a null-terminated C string. A rope containing more than rope_leaf_max characters is stored as a root non_leaf_rope_node pointing to the left and right sub-strings, (which may in turn point to left and right sub-sub-strings), eventually pointing to leaf nodes, and the leaf nodes each contain at least one character of the full string.

A rope always stores at least one character, so we can always tell: If the first byte of a rope node is non-zero, that node is a leaf node storing literal characters. If the first byte of a rope node is zero, that node stores pointers to left and right sub-strings. (Real rope implementations often have a third kind of rope node).

Often using ropes requires less total RAM space than using C strings. (A node containing a phrase such as "New York City" can be re-used multiple times in one rope, or in some implementations shared between two ropes). Sometimes using ropes is faster than using C strings.