I'm starting a new project in plain C (c99) that is going to work primarily with text. Because of external project constraints, this code has to be extremely simple and compact, consisting of a single source-code file without external dependencies or libraries except for libc and similar ubiquitous system libraries.
With that understanding, what are some best-practices, gotchas, tricks, or other techniques that can help make the string handling of the project more robust and secure?
Without any additional information about what your code is doing, I would recommend designing all your interfaces like this:
size_t foobar(char *dest, size_t buf_size, /* operands here */)
with semantics like snprintf
:
dest
points to a buffer of size at least buf_size
.
- If
buf_size
is zero, null/invalid pointers are acceptable for dest
and nothing will be written.
- If
buf_size
is non-zero, dest
is always null-terminated.
- Each function
foobar
returns the length of the full non-truncated output; the output has been truncated if buf_size
is less than or equal to the return value.
This way, when the caller can easily know the destination buffer size that's required, a sufficiently large buffer can be obtained in advance. If the caller cannot easily know, it can call the function once with either a zero argument for buf_size
, or with a buffer that's "probably big enough" and only retry if you ran out of space.
You can also make a wrapped version of such calls analogous to the GNU asprintf
function, but if you want your code to be as flexible as possible I would avoid doing any allocation in the actual string functions. Handling the possibility of failure is always easier at the caller level, and many callers can ensure that failure is never a possibility by using a local buffer or a buffer that was obtained much earlier in the program so that the success or failure of a larger operation is atomic (which greatly simplifies error handling).
Some thoughts from a long-time embedded developer, most of which elaborate on your requirement for simplicity and are not C-specific:
Decide which string-handling functions you'll need, and keep that set as small as possible to minimize the points of failure.
Follow R.'s suggestion to define a clear interface that is consistent across all string handlers. A strict, small-but-detailed set of rules allows you to use pattern-matching as a debugging tool: you can be suspicious of any code that looks different from the rest.
As Bart van Ingen Schenau noted, track the buffer length independently of the string length. If you'll always be working with text it's safe to use the standard null character to indicate end-of-string, but it's up to you to ensure the text+null will fit in the buffer.
Ensure consistent behavior across all string handlers, particularly where the standard functions are lacking: truncation, null inputs, null-termination, padding, etc.
If you absolutely need to violate any of your rules, create a separate function for that purpose and name it appropriately. In other words, give each function a single unambiguous behavior. So you might use str_copy_and_pad()
for a function that always pads its target with nulls.
Wherever possible, use safe built-in functions (e.g. memmove()
per Jonathan Leffler) to do the heavy lifting. But test them to be sure they're doing what you think they're doing!
Check for errors as soon as possible. Undetected buffer overruns can lead to "ricochet" errors that are notoriously difficult to locate.
Write tests for every function to ensure it satisfies its contract. Be sure to cover the edge cases (off by 1, null/empty strings, source/destination overlap, etc.) And this may sound obvious, but be sure you understand how to create and detect a buffer underrun/overrun, then write tests that explicitly generate and check for those problems. (My QA folks are probably sick of hearing my instructions to "don't just test to make sure it works; test to make sure it doesn't break.")
Here are some techniques that have worked for me:
Create wrappers for your memory-management routines that allocate "fence bytes" on either end of your buffers during allocation and check them upon deallocation. You can also verify them within your string handlers, perhaps when a STR_DEBUG macro is set. Caveat: you'll need to test your diagnostics thoroughly, lest they create additional points of failure.
Create a data structure that encapsulates both the buffer and its length. (It can also contain the fence bytes if you use them.) Caveat: you now have a non-standard data structure that your entire code base must manage, which may mean a substantial re-write (and therefore additional points of failure).
Make your string handlers validate their inputs. If a function forbids null pointers, check for them explicitly. If it requires a valid string (like strlen()
should) and you know the buffer length, check that the buffer contains a null character. In other words, verify any assumptions you might be making about the code or data.
Write your tests first. That will help you understand each function's contract--exactly what it expects from the caller, and what the caller should expect from it. You'll find yourself thinking about the ways you'll use it, the ways it might break, and about the edge cases it must handle.
Thanks so much for asking this question! I wish more developers would think about these issues--especially before they start coding. Good luck, and best wishes for a robust, successful product!
Have a look at strlcpy
and strlcat
, see the original paper
for details.
Two cents:
- Always use the "n" version of the string functions: strncpy, strncmp, (or wcsncpy, wcsncmp etc.)
- Always allocate using the +1 idiom: e.g. char* str[MAX_STR_SIZE+1], and then pass MAX_STR_SIZE as the size for the "n" version of the string functions and finish with str[MAX_STR_SIZE] = '\0'; to make sure all strings are properly finalized.
The final step is important since the "n" version of the string functions won't append '\0' after copying if the maximum size was reached.
Some important gotchas are:
- In C, there is no relation at all between string length and buffer size. A string always runs up to (and including) the first
'\0'
-character. It is your responsibility as a programmer to make sure this character can be found within the reserved buffer for that string.
- Always explicitly keep track of buffer sizes. The compiler keeps track of array sizes, but that information will be lost to you before you know it.
When it comes to time vs space, don't forget to pick the standard bit twiddling from here
During my early firmware projects, I used the look up tables to count the bit set in a O(1) operation efficiency.