Need help in understanding Rolling Hash computatio

2020-06-12 04:51发布

问题:

I've been trying to implement Rabin-Karp algorithm in Java. I have hard time computing the rolling hash value in constant time. I've found one implementation at http://algs4.cs.princeton.edu/53substring/RabinKarp.java.html. Still I could not get how these two lines work.

txtHash = (txtHash + Q - RM*txt.charAt(i-M) % Q) % Q;
txtHash = (txtHash*R + txt.charAt(i)) % Q;  

I looked at couple of articles on modular arithmetic but no article could able to penetrate my thick skull. Please give some pointers to understand this.

回答1:

This is the "rolling" aspect of the hash. It's eliminating the contribution of the oldest character (txt.charAt(i-M)), and incorporating the contribution of the newest character(txt.charAt(i)).

The hash function is defined as:

            M-1
hash[i] = ( SUM { input[i-j] * R^j } ) % Q
            j=0

(where I'm using ^ to denote "to the power of".)

But this can be written as an efficient recursive implementation as:

hash[i] = (txtHash*R - input[i-M]*(R^M) + input[i]) % Q

Your reference code is doing this, but it's using various techniques to ensure that the result is always computed correctly (and efficiently).

So, for instance, the + Q in the first expression has no mathematical effect, but it ensures that the result of the sum is always positive (if it goes negative, % Q doesn't have the desired effect). It's also breaking the calculation into stages, presumably to prevent numerical overflow.



回答2:

First you need to understand how the hash is computed.

Lets take a simple case of base 10 strings. How would you guarantee that the hash code of a string is unique? Base 10 is what we use to represent numbers, and we don't have collisions!!

"523" = 5*10^2 + 2*10^1 + 3*10^0 = 523

using the above hash function you are guaranteed to get distinct hashes for every string.

Given the hash of "523", if you want to calculate the hash of "238", i.e. by jutting out the leftmost digit 5 and bringing in a new digit 8 from the right, you would have to do the following:

1) remove the effect of the 5 from the hash: hash = hash - 5*10^2 (523-500 = 23)

2) adjust the hash of the remaining chars by shifting by 1, hash = hash * 10

3) add the hash of the new character: hash = hash + 8 (230 + 8 = 238, which as we expected is the base 10 hash of "238")

Now let's extend this to all ascii characters. This takes us to the base 256 world. Therefore the hash of the same string "523" now is

= 5*256^2 + 2*256^1 + 3*256^0 = 327680 + 512 + 3 = 328195.

You can imagine as the string length increases you will will exceed the range of integer/long in most programming languages relatively quickly.

How can we solve this? The way this is routinely solved is by working with modulus a large prime number. The drawback of this method is that we will now get false positives as well, which is a small price to pay if it takes the runtime of your algorithm from quadratic to linear!

The complicated equation you quoted is nothing but the steps 1-3 above done with modulus math. The two modulus properties used above are ->

a) (a*b) % p = ((a % p) * (b % p)) % p

b) a % p = (a + p) % p

Lets go back to steps 1-3 mentioned above ->

1) (expanded using property a) hash = hash - ((5 % p)*(10^2 %p) %p)

vs. what you quoted

txtHash = (txtHash + Q - RM*txt.charAt(i-M) % Q) % Q;

Here are is how the two are related!

  • RM = 10^3 % p
  • txt.charAt(i-M) % Q = 5 % p
  • The additional + Q you see is just to ensure that the hash is not negative. See property b above.

2 & 3) hash = hash*10 + 8, vs txtHash = (txtHash*R + txt.charAt(i)) % Q; Is the same but with taking mod of the final hash result!

Looking at properties a & b more closely, should help you figure it out!