Most frequent character in range

2019-06-22 07:50发布

问题:

I have a string s of length n. What is the most efficient data structure / algorithm to use for finding the most frequent character in range i..j?

The string doesn't change over time, I just need to repeat queries that ask for the most frequent char among s[i], s[i + 1], ... , s[j].

回答1:

An array in which you hold the number of occurences of each character. You increase the respective value while iterating throught the string once. While doing this, you can remember the current max in the array; alternitively, look for the highest value in the array at the end.

Pseudocode

arr = [0]
for ( char in string )
   arr[char]++
mostFrequent = highest(arr)


回答2:

Do a single iteration over the array and for each position remember how many occurances of each character are there up to that position. So something like this:

"abcdabc"

for index 0:

count['a'] = 1
count['b'] = 0
etc...

for index 1:

....
count['a'] = 1
count['b'] = 1
count['c'] = 0
etc...

for index 2:

....
count['a'] = 1
count['b'] = 1
count['c'] = 1
....

And so on. For index 6:

....
count['a'] = 2
count['b'] = 2
count['c'] = 2
count['d'] = 1
... all others are 0

After you compute this array you can get the number of occurances of a given letter in an interval (i, j) in constant time - simply compute count[j] - count[i-1] (careful here for i = 0!).

So for each query you will have to iterate over all letters not over all characters in the interval and thus instead of iterating over 10^6 characters you will only pass over at most 128(assuming you only have ASCII symbols).

A drawback - you need more memory, depending on the size of the alphabet you are using.



回答3:

If you wish to get efficient results on intervals, you can build an integral distribution vector at each index of your sequence. Then by subtracting integral distributions at j+1 and i you can obtain the distribution at the interval from s[i],s[i+1],...,s[j].

Some pseudocode in Python follows. I assume your characters are chars, hence 256 distribution entries.

def buildIntegralDistributions(s):
    IDs=[]        # integral distribution
    D=[0]*256
    IDs.append(D[:])
    for x in s:
        D[ord(x)]+=1
        IDs.append(D[:])
    return IDs

def getIntervalDistribution(IDs, i,j):
    D=[0]*256        
    for k in range(256):
        D[k]=IDs[j][k]-IDs[i][k]
    return D

s='abababbbb'
IDs=buildIntegralDistributions(s)
Dij=getIntervalDistribution(IDs, 2,4)

>>> s[2:4]
'ab'
>>> Dij[ord('a')]  # how many 'a'-s in s[2:4]?
1
>>> Dij[ord('b')]  # how many 'b'-s in s[2:4]?
1


回答4:

You need to specify your algorithmic requirements in terms of space and time complexity.

If you insist on O(1) space complexity, just sorting (e.g. using lexicographic ordering of bits if there is no natural comparison operator available) and counting the number of occurances of the highest element will give you O(N log N) time complexity.

If you insist on O(N) time complexity, use @Luchian Grigore 's solution which also takes O(N) space complexity (well, O(K) for K-letter alphabet).



回答5:

string="something"
arrCount[string.length()];

after each access of string call freq()

freq(char accessedChar){
arrCount[string.indexOf(x)]+=1
}

to get the most frequent char call the string.charAt(arrCount.max())



回答6:

assuming the string is constant, and different i and j will be passed to query occurences.

If you want to minimize processing time you can make a

struct occurences{
    char c;
    std::list<int> positions;
};

and keep a std::list<occurences> for each character. for fast searching you can keep positions ordered.

and if you want to minimize memory you can just keep an incrementing integer and loop through i .. j



回答7:

The most time-efficient algorithm, as has been suggested, is to store the frequencies of each character in an array. Note, however, that if you simply index the array with the characters, you may invoke undefined behaviour. Namely, if you are processing text that contains code points outside of the range 0x00-0x7F, such as text encoded with UTF-8, you may end up with a segmentation violation at best, and stack data corruption at worst:

char frequncies [256] = {};
frequencies ['á'] = 9; // Oops. If our implementation represents char using a
                       // signed eight-bit integer, we just referenced memory
                       // outside of our array bounds!

A solution that properly accounts for this would look something like the following:

template <typename charT>
charT most_frequent (const basic_string <charT>& str)
{
    constexpr auto charT_max = numeric_limits <charT>::max ();
    constexpr auto charT_min = numeric_limits <charT>::lowest ();
    size_t frequencies [charT_max - charT_min + 1] = {};

    for (auto c : str)
        ++frequencies [c - charT_min];

    charT most_frequent;
    size_t count = 0;
    for (charT c = charT_min; c < charT_max; ++c)
        if (frequencies [c - charT_min] > count)
        {
            most_frequent = c;
            count = frequencies [c - charT_min];
        }

    // We have to check charT_max outside of the loop,
    // as otherwise it will probably never terminate
    if (frequencies [charT_max - charT_min] > count)
        return charT_max;

    return most_frequent;
}

If you want to iterate over the same string multiple times, modify the above algorithm (as construct_array) to use a std::array <size_t, numeric_limits <charT>::max () - numeric_limits <charT>::lowest () + 1>. Then return that array instead of the max character after the first for loop and omit the part of the algorithm that finds the most frequent character. Construct a std::map <std::string, std::array <...>> in your top-level code and store the returned array in that. Then move the code for finding the most frequent character into that top-level code and use the cached count array:

char most_frequent (string s)
{
    static map <string, array <...>> cache;

    if (cache.count (s) == 0)
        map [s] = construct_array (s);
    // find the most frequent character, as above, replacing `frequencies`
    // with map [s], then return it
}

Now, this only works for whole strings. If you want to process relatively small substrings repeatedly, you should use the first version instead. Otherwise, I'd say that your best bet is probably to do something like the second solution, but partitioning the string into manageable chunks; that way, you can fetch most of the information from your cache, only having to recalculate the frequencies in the chunks in which your iterators reside.



回答8:

The fastest would be to use an unordered_map or similar:

pair<char, int> fast(const string& s) {
    unordered_map<char, int> result;

    for(const auto i : s) ++result[i];
    return *max_element(cbegin(result), cend(result), [](const auto& lhs, const auto& rhs) { return lhs.second < rhs.second; });
}

The lightest, memory-wise, would require a non-constant input which could be sorted, such that find_first_not_of or similar could be used:

pair<char, int> light(string& s) {
    pair<char, int> result;
    int start = 0;

    sort(begin(s), end(s));
    for(auto finish = s.find_first_not_of(s.front()); finish != string::npos; start = finish, finish = s.find_first_not_of(s[start], start)) if(const int second = finish - start; second > result.second) result = make_pair(s[start], second);
    if(const int second = size(s) - start; second > result.second) result = make_pair(s[start], second);
    return result;
}

It should be noted that both of these functions have the precondition of a non-empty string. Also if there is a tie for the most characters in the string both functions will return the character that is lexographically first as having the most.

Live Example