How to find a specific record in spark in cluster

2019-09-18 05:08发布

问题:

I have a simple csv like this, but its got 1 million records:

Name, Age
Waldo, 5
Emily, 7
John, 4
Amy Johns, 2
Kevin, 4
...

I want to find someone with the name "Amy Johns". I have a spark cluster of 10 machines. Assuming rdd contains the RDD of the csv, how can I take advantage of the cluster so that I can...

  1. Split up the work so that each of the 10 machines are working on 1/10th of the original gigantic set.
  2. When the FIRST occurence of "Amy Johns" is found and output to the console, the job is done. (e.g. If Machine #4 found "Amy Johns", all the other machines should stop looking and the result is output)

My code right now just does: rdd = sc.textFile

Then it does a rdd.foreach( // checks if field is "Amy Johns", if so, then exits).

The problem I have with this is that the rdd contains ALL the records (if this is not the case, speak up) so I don't think work is being distributed. Also, I don't know how to finish/stop the job once "Amy Johns" is found.

回答1:

Your RDD by virtue of its definition does contain all of the records. You can split your RDD into multiple partitions to increase the parallelization on computation. Furthermore, after partitioning, you can apply a transformation on your RDD to filter elements according to some criteria.

You may want to try something like this:

val myRDD = sc.textFile(_inputPath_, 10)
val filteredRDD = myRDD.filter(line => line.split(",")(0).equals("Amy Johns"))
filteredRDD.first.foreach(println)