I've seen second one in another's code and I suppose this length comparison have been done to increase code productivity. It was used in a parser for a script language with a specific dictionary: words are 4 to 24 letters long with the average of 7-8 lettets, alphabet includes 26 latin letters plus "@","$" and "_".
Length comparison were used to escape == operator working with STL strings, which obviously takes more time then simple integer comparison. But in the same time first letter distribution in the given dictionary is simply wider than a distribution of words size, so two first letters of comparing strings will be generally more often different, than the sizes of that strings. That makes length comparison unnecessary.
I've ran some tests and that is what I've found out: While testing two random strings comparison million times, second way is much faster, so length comparison seems to be helpful. But in a working project it works even slower in a debug mode and insufficiantly faster in a release mode.
So, my question is: why length comparison can fasten the comparison and why can it slow it down?
UPD: I don't like that second way either, but it had been done for a reason, I suppose, and I wonder, what is this reason.
UPD2: Seriously, the question is not how to do best. I'm not even using STL strings in this case anymore. There's no wonder that length comparison is unnecessary and wrong etc. The wonder is - it really tends to work slightly better in one certain test. How is this possible?
If it mattered, assume that your library did it already. Don't mess up your code this way for micro-optimisations unless it really matters.
length comparison doesn't make any sense to me .. using the comparison operator is enough
In your random test the strings might have been long enough to show the gain while in your real case you may deal with shorter strings and the constant factor of two comparison is not offset by any gain in not performing the string comparison part of the test.
When can Short Circuiting be beneficial
Short circuit optimizations can be helpful only when:
Mathematically, let S be cost of Short Circuiting condition, F cost of full condition, and P be percent of cases where Short Circuiting happens (full condition is not necessary).
The average cost of original case (no Short Circuiting) is F
The average cost of Short Circuiting optimization is S + F * (1-P)
Therefore, if the optimization is to have any benefit at all, following must apply:
S + F * (1-P) < F
i.e.
S < F*P
String comparison cost
Further you wrote:
which obviously takes more time then simple integer comparison.
This is not obvious at all. The string comparison terminates when first difference is found, therefore depending on what strings you process, it may terminate on first or second character in vast majority of cases. Moreover, the comparison may be optimized even for longer strings by first comparing DWORDS (4 characters at once) as long as there is enough data in both strings.
Your case
The major difference between random test data and scripting parsing is the real data are far from random. The parser is most likely deterministic, and once it matches, it does not compare any more. Even the script data are not random - some keywords are likely to be used a lot more than other ones. If the parser is constructed in such a way it checks most commonly used keyword first, a surprisingly high number of comparisons may need the full compare to be done, as full compare always needs to be done when string are matching.
The implementation of the std::string operator== has no way of knowing whether it would be faster to check the length first or start checking characters. Clearly checking the length is a waste for strings of the same length. Therefore, different implementations of STL are likely to perform differently.
Only put the explicit length check in as a final optimisation (clearly commented as such), and only if your profiler confirms the benefit.
The length of the std::string is quite likely a member of the std::string object. In comparison, the first character might very well be on the heap. That means that comparing the string length improves locality of reference. Of course, with the short-string optimization this becomes even more complex -
Lhs[0]
might be on the heap whileRhs[0]
is on the stack.