Given a set of strings, say "String1", "String2",..., "StringN"
, what is the most efficient way in C++ to determine (return true
or false
) whether given string s
matches any of the strings in the above set?
Can Boost.Regex be used for this task?
As an alternative you can define an ordered array of character arrays or strings and use standard algorithms std::binary_search, std::lower_bound, std::upper_bound or std::equal_range to check whether the target string is present in the array.
I will use the example already shown here.
std::unordered_set
would provide the most efficient look-up (amortized constant time).In some situations in which you frequently need to perform these checks, a great performance improvement can be obtained from Interning.
Interning still requires us to have some string lookup data structure, such as a tree or hash table. However, we do these heavy lookups more rarely: specificaly, we do them only whenever some raw textual input arrives from the environment into our software system.
At that time, we take the input text as a character string and intern it: look it up in the existing set of strings, and convert it to an atom. An atom is small unit of data, typically a single-word quantity such as a machine pointer. If the string doesn't exist, the interning function gives us a new, unique atom. Otherwise, it gives us the same atom that it gave us previously for that string.
Once we have interned input strings to atoms, we always use the atoms in their place. So instead of comparing whether two strings are the same, we compare whether two atoms are the same, which is a "blindingly fast" single-word comparison operation. (We still use the the strings when we need to print atoms in a readable way: again at the boundary between our system and the outside world).
Interning comes from Lisp: in Lisp symbols are atoms. In the raw source code, they are textual, and so when code is read into memory, symbol names are interned to produce atoms which are basically pointers to symbol objects.
Interning crops up in other places such as the X Window system (
XInternAtom
function):and in the Microsoft Windows API where the term "intern" is not used, but the function returns something called an
ATOM
: Lisp terminology. What is interned is not a simple string but a "class" strucure:In both systems, these atoms are system-wide (server-wide in the case of X) and can be compared for equality in place of the objects they represent. In Windows if you have two
ATOM
values which are equal, they are the same class.The Flyweight Design Pattern from the GoF book is essentially a reinvention of interning, extended to structures other than strings (like
WNDCLASS
in the above Win32 API); so if you want to "sell" the idea to your boss, you can take it from that angle.You can put all your strings in a std::set and then check if that string is in the set.
Assuming the strings are not entirely random and my have prefixes in common, it may be more efficient to use a Trie: initially building the data structure may be more expensive than creating other containers but if there are many queries made against the set of strings this could pay off. The main disadvantage is that there is no trie implementation in the standard C++ library.
An alternative is to construct a N-ary tree to store all strings.
Node look like:
So, with
"String1"
,"String10"
,"String2"
,"StringN"
, Tree is:Once you have your tree, look into it to see if the string matches.
Search Complexity: size of the string to search.