Rolling hash in Rabin-Karp

2019-07-06 19:30发布

问题:

I am trying to implement the Rabin-Karp for finding the substring; and I got stuck at the rolling hash(trying to use the formula suggested in Wikipedia).

#define MOD 1000000007
unsigned long long rolling_hash(const char *str)
{
        unsigned long long hash = 0;
        size_t str_len = strlen(str);
        for(int i = 0, k = str_len -1; i < str_len; i++, k--) {
                hash = hash + str[i] * pow(257, k);
        //      hash = hash % MOD;
        }
        return hash;
}

int main(void)
{
        printf("%llu\n", rolling_hash("TestString"));
        printf("%llu\n", rolling_hash("estStringh"));
        unsigned long long old = rolling_hash("TestString");
        // Add a character to the end
        // since the last char in old was multiplied by 1, now multiply it by
        // the base and then add the _new_ character to the end
        old = old * 257 + 'h';
        //old = old % MOD;
        // Remove a char from the start
        // Simply, remove the hash value of the first character
        old = old - 'T' * pow(257, 10);;

        printf("\n%llu\n", old);
        return 0;
}

The code above works perfectly fine as long as I do not introduce any remainder operations; once I uncomment my % operations, things break down and the answer I get from the changes over the rolling hash won't equal that which's being printed by the second print.

janisz's answer:
The suggestion to change the hash generator as in janisz's answer got the remainder to work when adding new characters but NOT when removing the old ones.
Note: I am using my own pow function to work with unsigned long long

回答1:

Hash genrator code is wrong. It should be

hash = (hash*257 + str[i]) % MOD;

and unncoment old_hash = old_hash % MOD;. Also change the way you generate new hash from previous

(old_hash - to_delete_char * pow(257, str_len-1)) % MOD;

Take a look at your code. First 2 lines are perfectly good. What happen in the loop. First of all you are doing as much multiplies as you can. In my approach I use Horner scheme of computing hash becouse hash is a polynomial.

Why it works when without modulus and with not. I think it's a coincidence becouse you overflow integer with 8 characters (log(2^64)/log(257) = 8).

Now what's wrong with removing characters. to_delete_char * pow(257, str_len); should be to_delete_char * pow(257, str_len-1); index should start from 0 not 1 to mach your generator.

EDIT: I think problem was in pow function. As I wrote above it overflow just with 8 characters. In your example you have 10 so it can't work.

EDIT: It turns out that adding and removing character must be done as a one operation. Probably due to equivalents but I'm not sure.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>

#define MOD 787

unsigned long long pow(int x, int y)
{
    unsigned long long ret = 1;
    for (int i=0;i<y;i++)
        ret = (ret*x)%MOD;
    return ret;
}
unsigned long long rolling_hash(const char *str)
{
        unsigned long long hash = 0;
        size_t str_len = strlen(str);
        for(int i = 0, k = str_len -1; i < str_len; i++, k--) {
                hash = hash + (str[i] * pow(257, k))%MOD;
                hash = hash % MOD;
        }
        return hash;
}

int main(void)
{
        char input[] = "TestString";
        printf("Input: %llu\n", rolling_hash(input));
        printf("Expected: %llu\n", rolling_hash("estStringh"));
        unsigned long long old = rolling_hash(input);
        // Add a character to the end
        // and Remove a char from the start

        unsigned long long  h = (input[0] * pow(257, strlen(input)))%MOD;
        old = ((old * 257) + 'h' - h) % MOD;

        printf("Actual: %llu\n", old);
        return 0;
}


回答2:

I think, using pow() is slow and dangerous, because of it returns double value, and for long strings there can be computation error (double precision error), and subtraction value is not exactly as same as addition. This can result elusive bug, when string cannot be matched.

I suggest you to use cyclic shift and XOR instead. These operation are quick and do not have "float/double precision error"

uint32_t hash = 0;
// This is not changed during cycle, so can be computed once before search.
int rols = str_len & 31; 

Adding to hash:

hash ^= ch;
hash = (hash << 1) | (hash >> 31);

Remove from hash:

uint32_t x = ch;
x = (x << rols) | (x >> (32 - rols));
hash ^= x;

Important: Need to apply Remove after Adding.