I’ve googled quite a bit, but I can’t find information on how variable-length strings are generally implemented in higher-level languages. I’m creating my own such language, and am not sure where to start with strings.
I have a struct describing the string
type, and then a create
function that allocates such a ‘string’:
/* A safer `strcpy()`, using `strncpy()` and `sizeof()` */
#define STRCPY(TO, FROM) \
strncpy(TO, FROM, sizeof(TO)); TO[sizeof(TO) - 1] = '\0'
struct string {
// …
char native[1024];
};
string String__create(char native[]) {
string this = malloc(sizeof(struct string));
// …
STRCPY(this->native, native);
return this;
}
However, that would only allow 1kb-long strings. That’s sort of silly, and a huge waste of memory in most cases.
Given that I have to declare the memory to be used somehow… how do I go about implementing a string that can (efficiently) store an (effectively) unbounded number of characters?
In addition to what others have told you, I'd also include the concept of "capacity": It is not possible to know the size of the allocated vector in memory, you must store it. If you do a sizeof of the String struct, it will return you 4 bytes * 3 numeric fields = 12 bytes (probably bigger due to padding used in structures). Also, you cannot get the length of allocated memory through sizeof.
This way, capacity always bears the actual size of the chunk in which your string is. Say that your string goes shorter: you don't have to realloc in order to get an exact match between the capacity and the size of your string. Also, you can alloc from the beginning the characters you expect the string to have, and not the characters the initial string has. Finally, you can mimic the C++ string char dynamic vector and double capacity each time the string grows beyond the capacity limit. All of these will keep memory operations to a minimum, which will translate in better performance (you will also waste some memory, but never as much as 1Kb).
Many C++
std::string
implementations now use a "Small String Optimization". In pseudo-code:The idea is that string up to 12 characters are stored in the
shortString
array. The entire string will be contiguous and use only a single cache line. Longer strings are stored on the heap. This leaves you with 12 spare bytes in the string object. The pointer doesn't take all of that, so you can also remember how much memory you've allocated on the heap (>=length
). That helps to support scenario's in which you grow a string in small increments.Some people prefer to use the "rope" data structure to store a string of characters of unbounded length, rather than a contiguous string (C string).
A simplified rope can be defined something like:
A rope with less than rope_leaf_max characters is stored the same as a null-terminated C string. A rope containing more than rope_leaf_max characters is stored as a root non_leaf_rope_node pointing to the left and right sub-strings, (which may in turn point to left and right sub-sub-strings), eventually pointing to leaf nodes, and the leaf nodes each contain at least one character of the full string.
A rope always stores at least one character, so we can always tell: If the first byte of a rope node is non-zero, that node is a leaf node storing literal characters. If the first byte of a rope node is zero, that node stores pointers to left and right sub-strings. (Real rope implementations often have a third kind of rope node).
Often using ropes requires less total RAM space than using C strings. (A node containing a phrase such as "New York City" can be re-used multiple times in one rope, or in some implementations shared between two ropes). Sometimes using ropes is faster than using C strings.
realloc will relocate your memory to a bigger buffer -- simply using this command will allow you to resize your string. Use the following struct for your string:
The important part being keeping a track of the size of your string, and having every string manipulation function testing if the size makes sense.
You could pre-allocate a large buffer and keep adding to it, and only realloc when said buffer is full -- you have to implement all the functions for this. It is preferable (far less error prone, and more secure) to mutate string by moving from one
immutable string
to another (using the semantics of realoc).The common approach is to have a field for length and a pointer to a dynamically allocated region of memory to hold the characters.
If you want to dynamically allocate the
string_t
: