Best approach to continuously scan for a string in

I have this situation where my function continuously receive data of various length. The data can be anything. I want to find the best way I to hunt for particular string in this data. The solution will require somehow to buffer previous data but I cannot wrap my head around the problem.

Here is an example of the problem:

DATA IN -> [\x00\x00\x01\x23B][][LABLABLABLABLA\x01TO][KEN][BLA\x01]...

if every [...] represents a data chunk and [] represents a data chunk with no items, what is the best way to scan for the string TOKEN?

UPDATE: I realised the question is a bit more complex. the [] are not separators. I just use them to describe the structure of the chunk per above example. Also TOKEN is not a static string per-se. It is variable length. I think the best way to read line by line but than the question is how to read a streaming buffer of variable length into lines.

标签： c callback stream buffer

3条回答

趁早两清

2楼-- · 2019-08-01 22:58

The simplest way to search for TOKEN is:

try to match "TOKEN" starting from position 0 in the stream
try to match "TOKEN" starting from position 1 in the stream
etc

So all you need to buffer is a number of bytes from the stream equal to the length of "TOKEN" (5 bytes, or actually 4 will do). At each position try to match "TOKEN", which might require waiting until you have at least 5 bytes read into your buffer. If the match fails, rewind to where you started matching, plus one. Since you never rewind more than the length of the string you're searching for (minus one) that's all the buffer you really need.

The technical issue then is, how to maintain your 5 bytes of buffered data as you read continuously from the stream. One way is a so-called "circular buffer". Another way, especially if the token is small, is to use a larger buffer and whenever you get too near the end, copy the bytes you need to the beginning and start again.

If your function is a callback, called once for each new chunk of data, then you will need to maintain some state from one call to the next to allow for a match that spans two chunks. If you're lucky then your callback API includes a "user data pointer", and you can set that to point to whatever struct you like that includes the buffer. If not, you'll need global or thread-local variables.

If the stream has a high data rate then you might want to think about speeding things up, with the KMP algorithm or otherwise.

0人赞添加讨论(0) 举报

闹够了就滚

3楼-- · 2019-08-01 23:13

If the needle is contained within memory, it could be assumed that you can allocate an equally-sized object to read into (e.g. char input_array[needle_size];).

To start the search process, fill that object with bytes from your file (e.g. size_t sz = fread(input_array, 1, input_size, input_file);) and attempt a match (e.g. if (sz == needle_size && memcmp(input_array, needle, needle_size) == 0) { /* matched */ }.

If the match fails or you want to continue searching after a successful match, advance the position forward by one byte (e.g. memmove(input_array, input_array + 1, input_size - 1); input_array[input_size - 1] = fgetc(input_file); and try again.

A concern was raised that this idea copies too many bytes around, in the comments. While I don't believe that this concern has a significant merit (as there is no evidence of significant value), the copies can be avoided by using a circular array; we insert new characters at pos % needle_size and compare the regions before and after that boundary as though they are the tail and head respectively. For example:

void find_match(FILE *input_file, char const *needle, size_t needle_size) {
    char input_array[needle_size];
    size_t sz = fread(input_array, 1, needle_size, input_file);
    if (sz != needle_size) {
        // No matches possible
        return;
    }

    setvbuf(input_file, NULL, _IOFBF, BUFSIZ);
    unsigned long long pos = 0;
    for (;;) {
        size_t cursor = pos % needle_size;
        int tail_compare = memcmp(input_array, needle + needle_size - cursor, cursor),
            head_compare = memcmp(input_array + cursor, needle, needle_size - cursor);
        if (head_compare == 0 && tail_compare == 0) {
            printf("Match found at offset %llu\n", pos);
        }
        int c = fgetc(input_file);
        if (c == EOF) {
            break;
        }
        input_array[cursor] = c;
        pos++;
    }
}

0人赞添加讨论(0) 举报

贪生不怕死

4楼-- · 2019-08-01 23:17

Sorry, I voted to delete my previous answer as my understanding of the question was not correct. I didn't read carefully enouogh and thought that the [] are token delimiters.

For your problem I'd recommend building a small state machine based on a simple counter: For every character you do something like the following pseudo code:

if (received_character == token[pos]) {
    ++pos;
    if (pos >= token_length) {
        token_received = 1;
    }
}
else {
    pos = 0; // Startover
}

This takes a minimum of processor cycles and also a minimum of memory aso you don't need to buffer anything except the chunk just received.

0人赞添加讨论(0) 举报

Best approach to continuously scan for a string in

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间