I have a string s
of length n
. What is the most efficient data structure / algorithm to use for finding the most frequent character in range i..j
?
The string doesn't change over time, I just need to repeat queries that ask for the most frequent char among s[i]
, s[i + 1]
, ... , s[j]
.
The fastest would be to use an
unordered_map
or similar:The lightest, memory-wise, would require a non-constant input which could be sorted, such that
find_first_not_of
or similar could be used:It should be noted that both of these functions have the precondition of a non-empty string. Also if there is a tie for the most characters in the string both functions will return the character that is lexographically first as having the most.
Live Example
The most time-efficient algorithm, as has been suggested, is to store the frequencies of each character in an array. Note, however, that if you simply index the array with the characters, you may invoke undefined behaviour. Namely, if you are processing text that contains code points outside of the range 0x00-0x7F, such as text encoded with UTF-8, you may end up with a segmentation violation at best, and stack data corruption at worst:
A solution that properly accounts for this would look something like the following:
If you want to iterate over the same string multiple times, modify the above algorithm (as
construct_array
) to use astd::array <size_t, numeric_limits <charT>::max () - numeric_limits <charT>::lowest () + 1>
. Then return that array instead of the max character after the first for loop and omit the part of the algorithm that finds the most frequent character. Construct astd::map <std::string, std::array <...>>
in your top-level code and store the returned array in that. Then move the code for finding the most frequent character into that top-level code and use the cached count array:Now, this only works for whole strings. If you want to process relatively small substrings repeatedly, you should use the first version instead. Otherwise, I'd say that your best bet is probably to do something like the second solution, but partitioning the string into manageable chunks; that way, you can fetch most of the information from your cache, only having to recalculate the frequencies in the chunks in which your iterators reside.