I want to implement a string matching(Boyer-Moore) algorithm using Hadoop. I just started using Hadoop so I have no idea how to write a Hadoop program in Java.
All the sample programs that I have seen so far are word counting examples and I couldn't find any sample programs for string matching.
I tried searching for some tutorials that teaches how to write Hadoop applications using Java but couldn't find any. Can you suggest me some tutorials where I can learn how to write Hadoop applications using Java.
Thanks in advance.
I haven't tested the below code, But this should get you started. I have used the BoyerMoore implementation available here
What the below code is doing:
The goal is to search for a pattern in an input document. The BoyerMoore class is initialized in the setup method using the pattern set in the configuration.
The mapper receives each line at a time and it uses the BoyerMoore instance to find the pattern. If match is found, the we write it using context.
There is no need of a reducer here. If the pattern is found multiple times in different mapper then the output will have multiple offsets(1 per mapper).
I don't know if this is the correct implementation to run an algorithm in parallel, but this is what I figured out,
I'm using AWS(Amazon Web Services) so I can select the number of nodes from the console that I want my program to run on simultaneously. So I'm assuming that the map and reduce methods that I have used should be enough for running the Boyer-Moore string matching algorithm in parallel.