I've seen this Rabin Karp string matching algorithm in the forums on the website and I'm interested in trying to implement it but I was wondering If anyone could tell me why the variables ulong Q and ulong D are 100007 and 256 respectively :S? What significance do these values carry with them?
static void Main(string[] args)
{
string A = "String that contains a pattern.";
string B = "pattern";
ulong siga = 0;
ulong sigb = 0;
ulong Q = 100007;
ulong D = 256;
for (int i = 0; i < B.Length; i++)
{
siga = (siga * D + (ulong)A[i]) % Q;
sigb = (sigb * D + (ulong)B[i]) % Q;
}
if (siga == sigb)
{
Console.WriteLine(string.Format(">>{0}<<{1}", A.Substring(0, B.Length), A.Substring(B.Length)));
return;
}
ulong pow = 1;
for (int k = 1; k <= B.Length - 1; k++)
pow = (pow * D) % Q;
for (int j = 1; j <= A.Length - B.Length; j++)
{
siga = (siga + Q - pow * (ulong)A[j - 1] % Q) % Q;
siga = (siga * D + (ulong)A[j + B.Length - 1]) % Q;
if (siga == sigb)
{
if (A.Substring(j, B.Length) == B)
{
Console.WriteLine(string.Format("{0}>>{1}<<{2}", A.Substring(0, j),
A.Substring(j, B.Length),
A.Substring(j + B.Length)));
return;
}
}
}
Console.WriteLine("Not copied!");
}
About the magic numbers Paul's answer is pretty clear.
As far as the code is concerned, Rabin Karp's principal idea is to perform an hash comparison between a sliding portion of the string and the pattern.
The hash cannot be computed each time on the whole substrings, otherwise the computation complexity would be quadratic
O(n^2)
instead of linearO(n)
.Therefore, a rolling hash function is applied, such as at each iteration only one character is needed to update the hash value of the substring.
So, let's comment your code:
^
This piece computes the hash of patternB
(sigb
), and the hashcode of the initial substring ofA
of the same length ofB
. Actually it's not completely correct because hash can collide¹ and so, it is necessary to modify the if statement :if (siga == sigb && A.Substring(0, B.Length) == B)
.^
Here's computedpow
that is necessary to perform the rolling hash.^
Finally, the remaining string (i.e. from the second character to end), is scanned updating the hash value of the A substring and compared with the hash of B (computed at the beginning).If the two hashes are equal, the substring and the pattern are compared¹ and if they're actually equal a message is returned.
¹ Hash values can collide; hence, if two strings have different hash values they're definitely different, but if the two hashes are equal they can be equal or not.
The algorithm uses hashing for fast string comparison. Q and D are magic numbers that the coder probably arrived at with a little bit of trial and error and give a good distribution of hash values for this particular algorithm.
You can see these types of magic numbers used for hashing many places. The example below is the decompiled definition of the GetHashCode function of a .NET 2.0 string type:
Here is another example from a R# generated GetHashcode override function for a sample type: