I have a file that's 21056 bytes.
I've written a program in C that reads the entire file into a buffer, and then uses multiple search algorithms to search the file for a token that's 82 chars.
I've used all the implementations of the algorithms from the “Exact String Matching Algorithms” page. I've used: KMP, BM, TBM, and Horspool. And then I used strstr
and benchmarked each one.
What I'm wondering is, each time the strstr
outperforms all the other algorithms. The only one that is faster sometimes is BM.
Shouldn't strstr
be the slowest?
Here's my benchmark code with an example of benchmarking BM:
double get_time()
{
LARGE_INTEGER t, f;
QueryPerformanceCounter(&t);
QueryPerformanceFrequency(&f);
return (double)t.QuadPart/(double)f.QuadPart;
}
before = get_time();
BM(token, strlen(token), buffer, len);
after = get_time();
printf("Time: %f\n\n", after - before);
Could someone explain to me why strstr
is outperforming the other search algorithms? I'll post more code on request if needed.
Imagine you want to get something cleaned. You could just clean it yourself, or you could hire ten professional cleaners to clean it. If the cleaning job is an office building, the latter solution would be preferable. If the cleaning job was one window, the former would be preferable.
You never get any payback for the time spent setting up to do the job efficiently because the job doesn't take very long.
Horspool, KMP et al are optimal at minimizing the number of byte-comparisons.
However, that's not the bottleneck on a modern processor. On an x86/64 processor, your string is being loaded into L1 cache in cache-line-width chunks (typically 64 bytes). No matter how clever your algorithm is, unless it gives you strides that are larger than that, you gain nothing; and the more complicated Horspool code is (at least one table lookup) can't compete.
Furthermore, you're stuck with the "C" string constraint of null-termination: SOMEWHERE the code has to examine every byte.
strstr()
is expected to be optimal for a wide range of cases; e.g. searching for tiny strings like"\r\n"
in a short string, as well as much longer ones where some smarter algorithm might have a hope. The basic strchr/memcmp loop is pretty hard to beat across the whole range of likely inputs.Pretty much all x86-compatible processors since 2003 have supported SSE2. If you disassembled
strlen()
/x86 for glibc, you may have noticed that it uses some SSE2 PCMPEQ and MOVMASK operations to search for the null terminator 16 bytes at a time. The solution is so efficient that it beats the obvious super-simple loop, for anything longer than the empty string.I took that idea and came up with a
strstr()
that beats glibc'sstrstr()
for all cases greater than 1 byte --- where the relative difference is pretty much moot. If you're interested, check out:Convergence SSE2 and
strstr()
A better
strstr()
with no ASM codeIf you care to see a non-SSE2 solution that dominates
strstr()
for target strings of more than 15 bytes, check out:which makes use of multibyte comparisons rather than
strchr()
, to find a point at which to do a memcmp.BTW you've probably figured out by now that the x86 REP SCASB/REP CMPSB ops fall on their ass for anything longer than 32 bytes, and are not much improvement for shorter strings. Wish Intel had devoted a little more attention to that, than to adding SSE4.2 "string" ops.
For strings large enough to matter, my perf tests show that BNDM is better than Horspool, across the board. BNDM is more tolerant of "pathological" cases, such as targets that heavily repeat the last byte of a pattern. BNDM can also make use of SSE2 (128bit registers) in a way that competes with 32bit registers in efficiency and start-up cost. Source code here.
Why do you think
strstr
should be slower than all the others? Do you know what algorithmstrstr
uses? I think it's quite likely thatstrstr
uses a fine-tuned, processor-specific, assembly-coded algorithm of theKMP
type or better. In which case you don't stand a chance of out-performing it inC
for such small benchmarks.(The reason I think this is likely is that programmers love to implement such things.)
Without seeing your code, it's hard to say exactly.
strstr
is heavily optimized, and usually written in assembly language. It does things like reading data 4 bytes at a time and comparing them (bit-twiddling if necessary if the alignment isn't right) to minimize memory latency. It can also take advantage of things like SSE to load 16 bytes at a time. If your code is only loading one byte at a time, it's probably getting killed by memory latency.Use your debugger and step through the disassembly of
strstr
-- you'll probably find some interesting things in there.