I'm starting a new project in plain C (c99) that is going to work primarily with text. Because of external project constraints, this code has to be extremely simple and compact, consisting of a single source-code file without external dependencies or libraries except for libc and similar ubiquitous system libraries.
With that understanding, what are some best-practices, gotchas, tricks, or other techniques that can help make the string handling of the project more robust and secure?
Without any additional information about what your code is doing, I would recommend designing all your interfaces like this:
with semantics like
snprintf
:dest
points to a buffer of size at leastbuf_size
.buf_size
is zero, null/invalid pointers are acceptable fordest
and nothing will be written.buf_size
is non-zero,dest
is always null-terminated.foobar
returns the length of the full non-truncated output; the output has been truncated ifbuf_size
is less than or equal to the return value.This way, when the caller can easily know the destination buffer size that's required, a sufficiently large buffer can be obtained in advance. If the caller cannot easily know, it can call the function once with either a zero argument for
buf_size
, or with a buffer that's "probably big enough" and only retry if you ran out of space.You can also make a wrapped version of such calls analogous to the GNU
asprintf
function, but if you want your code to be as flexible as possible I would avoid doing any allocation in the actual string functions. Handling the possibility of failure is always easier at the caller level, and many callers can ensure that failure is never a possibility by using a local buffer or a buffer that was obtained much earlier in the program so that the success or failure of a larger operation is atomic (which greatly simplifies error handling).Two cents:
The final step is important since the "n" version of the string functions won't append '\0' after copying if the maximum size was reached.
Some thoughts from a long-time embedded developer, most of which elaborate on your requirement for simplicity and are not C-specific:
Decide which string-handling functions you'll need, and keep that set as small as possible to minimize the points of failure.
Follow R.'s suggestion to define a clear interface that is consistent across all string handlers. A strict, small-but-detailed set of rules allows you to use pattern-matching as a debugging tool: you can be suspicious of any code that looks different from the rest.
As Bart van Ingen Schenau noted, track the buffer length independently of the string length. If you'll always be working with text it's safe to use the standard null character to indicate end-of-string, but it's up to you to ensure the text+null will fit in the buffer.
Ensure consistent behavior across all string handlers, particularly where the standard functions are lacking: truncation, null inputs, null-termination, padding, etc.
If you absolutely need to violate any of your rules, create a separate function for that purpose and name it appropriately. In other words, give each function a single unambiguous behavior. So you might use
str_copy_and_pad()
for a function that always pads its target with nulls.Wherever possible, use safe built-in functions (e.g.
memmove()
per Jonathan Leffler) to do the heavy lifting. But test them to be sure they're doing what you think they're doing!Check for errors as soon as possible. Undetected buffer overruns can lead to "ricochet" errors that are notoriously difficult to locate.
Write tests for every function to ensure it satisfies its contract. Be sure to cover the edge cases (off by 1, null/empty strings, source/destination overlap, etc.) And this may sound obvious, but be sure you understand how to create and detect a buffer underrun/overrun, then write tests that explicitly generate and check for those problems. (My QA folks are probably sick of hearing my instructions to "don't just test to make sure it works; test to make sure it doesn't break.")
Here are some techniques that have worked for me:
Create wrappers for your memory-management routines that allocate "fence bytes" on either end of your buffers during allocation and check them upon deallocation. You can also verify them within your string handlers, perhaps when a STR_DEBUG macro is set. Caveat: you'll need to test your diagnostics thoroughly, lest they create additional points of failure.
Create a data structure that encapsulates both the buffer and its length. (It can also contain the fence bytes if you use them.) Caveat: you now have a non-standard data structure that your entire code base must manage, which may mean a substantial re-write (and therefore additional points of failure).
Make your string handlers validate their inputs. If a function forbids null pointers, check for them explicitly. If it requires a valid string (like
strlen()
should) and you know the buffer length, check that the buffer contains a null character. In other words, verify any assumptions you might be making about the code or data.Write your tests first. That will help you understand each function's contract--exactly what it expects from the caller, and what the caller should expect from it. You'll find yourself thinking about the ways you'll use it, the ways it might break, and about the edge cases it must handle.
Thanks so much for asking this question! I wish more developers would think about these issues--especially before they start coding. Good luck, and best wishes for a robust, successful product!
Some important gotchas are:
'\0'
-character. It is your responsibility as a programmer to make sure this character can be found within the reserved buffer for that string.When it comes to time vs space, don't forget to pick the standard bit twiddling from here
During my early firmware projects, I used the look up tables to count the bit set in a O(1) operation efficiency.
Have a look at
strlcpy
andstrlcat
, see theoriginal paper
for details.