I have this situation where my function continuously receive data of various length. The data can be anything. I want to find the best way I to hunt for particular string in this data. The solution will require somehow to buffer previous data but I cannot wrap my head around the problem.
Here is an example of the problem:
DATA IN -> [\x00\x00\x01\x23B][][LABLABLABLABLA\x01TO][KEN][BLA\x01]...
if every [...] represents a data chunk and [] represents a data chunk with no items, what is the best way to scan for the string TOKEN?
UPDATE: I realised the question is a bit more complex. the [] are not separators. I just use them to describe the structure of the chunk per above example. Also TOKEN is not a static string per-se. It is variable length. I think the best way to read line by line but than the question is how to read a streaming buffer of variable length into lines.
The simplest way to search for TOKEN is:
So all you need to buffer is a number of bytes from the stream equal to the length of "TOKEN" (5 bytes, or actually 4 will do). At each position try to match "TOKEN", which might require waiting until you have at least 5 bytes read into your buffer. If the match fails, rewind to where you started matching, plus one. Since you never rewind more than the length of the string you're searching for (minus one) that's all the buffer you really need.
The technical issue then is, how to maintain your 5 bytes of buffered data as you read continuously from the stream. One way is a so-called "circular buffer". Another way, especially if the token is small, is to use a larger buffer and whenever you get too near the end, copy the bytes you need to the beginning and start again.
If your function is a callback, called once for each new chunk of data, then you will need to maintain some state from one call to the next to allow for a match that spans two chunks. If you're lucky then your callback API includes a "user data pointer", and you can set that to point to whatever struct you like that includes the buffer. If not, you'll need global or thread-local variables.
If the stream has a high data rate then you might want to think about speeding things up, with the KMP algorithm or otherwise.
If the needle is contained within memory, it could be assumed that you can allocate an equally-sized object to read into (e.g.
char input_array[needle_size];
).To start the search process, fill that object with bytes from your file (e.g.
size_t sz = fread(input_array, 1, input_size, input_file);
) and attempt a match (e.g.if (sz == needle_size && memcmp(input_array, needle, needle_size) == 0) { /* matched */ }
.If the match fails or you want to continue searching after a successful match, advance the position forward by one byte (e.g.
memmove(input_array, input_array + 1, input_size - 1); input_array[input_size - 1] = fgetc(input_file);
and try again.A concern was raised that this idea copies too many bytes around, in the comments. While I don't believe that this concern has a significant merit (as there is no evidence of significant value), the copies can be avoided by using a circular array; we insert new characters at
pos % needle_size
and compare the regions before and after that boundary as though they are the tail and head respectively. For example:Sorry, I voted to delete my previous answer as my understanding of the question was not correct. I didn't read carefully enouogh and thought that the [] are token delimiters.
For your problem I'd recommend building a small state machine based on a simple counter: For every character you do something like the following pseudo code:
This takes a minimum of processor cycles and also a minimum of memory aso you don't need to buffer anything except the chunk just received.