I am trying to implement the Rabin-Karp for finding the substring; and I got stuck at the rolling hash(trying to use the formula suggested in Wikipedia).
#define MOD 1000000007
unsigned long long rolling_hash(const char *str)
{
unsigned long long hash = 0;
size_t str_len = strlen(str);
for(int i = 0, k = str_len -1; i < str_len; i++, k--) {
hash = hash + str[i] * pow(257, k);
// hash = hash % MOD;
}
return hash;
}
int main(void)
{
printf("%llu\n", rolling_hash("TestString"));
printf("%llu\n", rolling_hash("estStringh"));
unsigned long long old = rolling_hash("TestString");
// Add a character to the end
// since the last char in old was multiplied by 1, now multiply it by
// the base and then add the _new_ character to the end
old = old * 257 + 'h';
//old = old % MOD;
// Remove a char from the start
// Simply, remove the hash value of the first character
old = old - 'T' * pow(257, 10);;
printf("\n%llu\n", old);
return 0;
}
The code above works perfectly fine as long as I do not introduce any remainder operations; once I uncomment my %
operations, things break down and the answer I get from the changes over the rolling hash won't equal that which's being printed by the second print.
janisz's answer:
The suggestion to change the hash generator as in janisz's answer got the remainder to work when adding new characters but NOT when removing the old ones.
Note: I am using my own pow
function to work with unsigned long long
Hash genrator code is wrong. It should be
hash = (hash*257 + str[i]) % MOD;
and unncoment old_hash = old_hash % MOD;
. Also change the way you generate new hash from previous
(old_hash - to_delete_char * pow(257, str_len-1)) % MOD;
Take a look at your code. First 2 lines are perfectly good. What happen in the loop.
First of all you are doing as much multiplies as you can. In my approach I use Horner scheme of computing hash becouse hash is a polynomial.
Why it works when without modulus and with not. I think it's a coincidence becouse you overflow integer with 8 characters (log(2^64)/log(257) = 8).
Now what's wrong with removing characters. to_delete_char * pow(257, str_len);
should be to_delete_char * pow(257, str_len-1);
index should start from 0 not 1 to mach your generator.
EDIT:
I think problem was in pow function. As I wrote above it overflow just with 8 characters. In your example you have 10 so it can't work.
EDIT: It turns out that adding and removing character must be done as a one operation. Probably due to equivalents but I'm not sure.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#define MOD 787
unsigned long long pow(int x, int y)
{
unsigned long long ret = 1;
for (int i=0;i<y;i++)
ret = (ret*x)%MOD;
return ret;
}
unsigned long long rolling_hash(const char *str)
{
unsigned long long hash = 0;
size_t str_len = strlen(str);
for(int i = 0, k = str_len -1; i < str_len; i++, k--) {
hash = hash + (str[i] * pow(257, k))%MOD;
hash = hash % MOD;
}
return hash;
}
int main(void)
{
char input[] = "TestString";
printf("Input: %llu\n", rolling_hash(input));
printf("Expected: %llu\n", rolling_hash("estStringh"));
unsigned long long old = rolling_hash(input);
// Add a character to the end
// and Remove a char from the start
unsigned long long h = (input[0] * pow(257, strlen(input)))%MOD;
old = ((old * 257) + 'h' - h) % MOD;
printf("Actual: %llu\n", old);
return 0;
}
I think, using pow() is slow and dangerous, because of it returns double value, and for long strings there can be computation error (double precision error), and subtraction value is not exactly as same as addition. This can result elusive bug, when string cannot be matched.
I suggest you to use cyclic shift and XOR instead. These operation are quick and do not have "float/double precision error"
uint32_t hash = 0;
// This is not changed during cycle, so can be computed once before search.
int rols = str_len & 31;
Adding to hash:
hash ^= ch;
hash = (hash << 1) | (hash >> 31);
Remove from hash:
uint32_t x = ch;
x = (x << rols) | (x >> (32 - rols));
hash ^= x;
Important: Need to apply Remove after Adding.