I am trying to implement the Rabin-Karp for finding the substring; and I got stuck at the rolling hash(trying to use the formula suggested in Wikipedia).
#define MOD 1000000007
unsigned long long rolling_hash(const char *str)
{
unsigned long long hash = 0;
size_t str_len = strlen(str);
for(int i = 0, k = str_len -1; i < str_len; i++, k--) {
hash = hash + str[i] * pow(257, k);
// hash = hash % MOD;
}
return hash;
}
int main(void)
{
printf("%llu\n", rolling_hash("TestString"));
printf("%llu\n", rolling_hash("estStringh"));
unsigned long long old = rolling_hash("TestString");
// Add a character to the end
// since the last char in old was multiplied by 1, now multiply it by
// the base and then add the _new_ character to the end
old = old * 257 + 'h';
//old = old % MOD;
// Remove a char from the start
// Simply, remove the hash value of the first character
old = old - 'T' * pow(257, 10);;
printf("\n%llu\n", old);
return 0;
}
The code above works perfectly fine as long as I do not introduce any remainder operations; once I uncomment my %
operations, things break down and the answer I get from the changes over the rolling hash won't equal that which's being printed by the second print.
janisz's answer:
The suggestion to change the hash generator as in janisz's answer got the remainder to work when adding new characters but NOT when removing the old ones.
Note: I am using my own pow
function to work with unsigned long long
I think, using pow() is slow and dangerous, because of it returns double value, and for long strings there can be computation error (double precision error), and subtraction value is not exactly as same as addition. This can result elusive bug, when string cannot be matched.
I suggest you to use cyclic shift and XOR instead. These operation are quick and do not have "float/double precision error"
Adding to hash:
Remove from hash:
Important: Need to apply Remove after Adding.
Hash genrator code is wrong. It should be
and unncoment
old_hash = old_hash % MOD;
. Also change the way you generate new hash from previousTake a look at your code. First 2 lines are perfectly good. What happen in the loop. First of all you are doing as much multiplies as you can. In my approach I use Horner scheme of computing hash becouse hash is a polynomial.
Why it works when without modulus and with not. I think it's a coincidence becouse you overflow integer with 8 characters (log(2^64)/log(257) = 8).
Now what's wrong with removing characters.
to_delete_char * pow(257, str_len);
should beto_delete_char * pow(257, str_len-1);
index should start from 0 not 1 to mach your generator.EDIT: I think problem was in pow function. As I wrote above it overflow just with 8 characters. In your example you have 10 so it can't work.
EDIT: It turns out that adding and removing character must be done as a one operation. Probably due to equivalents but I'm not sure.