I have a simple csv like this, but its got 1 million records:
Name, Age
Waldo, 5
Emily, 7
John, 4
Amy Johns, 2
Kevin, 4
...
I want to find someone with the name "Amy Johns". I have a spark cluster of 10 machines. Assuming rdd contains the RDD of the csv, how can I take advantage of the cluster so that I can...
- Split up the work so that each of the 10 machines are working on 1/10th of the original gigantic set.
- When the FIRST occurence of "Amy Johns" is found and output to the console, the job is done. (e.g. If Machine #4 found "Amy Johns", all the other machines should stop looking and the result is output)
My code right now just does: rdd = sc.textFile
Then it does a rdd.foreach( // checks if field is "Amy Johns", if so, then exits)
.
The problem I have with this is that the rdd contains ALL the records (if this is not the case, speak up) so I don't think work is being distributed. Also, I don't know how to finish/stop the job once "Amy Johns" is found.