Finding out duplicates in a dataset in scala

2019-09-15 06:44发布

问题:

I have a dataset which is a DataSet of String and it has the data

12348,5,233,234559,4
12348,5,233,234559,4
12349,6,233,234560,5
12350,7,233,234561,6

I want to figure out the duplicate rows in a dataset, how do i do that? I would like to remove the duplicates. in the example, the duplicated row is 12348,5,233,234559,4 and I want to output just a single instance of it

How do i go about doing it?

回答1:

Dimas answer should work. Here is another solution.

I think (not positive) groupby would hold all of the data in memory.. so perhaps this would be better for you.

val rows = scala.io.Source.fromFile("data.txt") // Assuming data is in a file
             .getLines  // Create an iterator from lines in file
             .foldLeft(Map.empty[String, Int]){ // Fold over empty Map
                (acc, row) => acc + (row -> (acc.getOrElse(row, 0) + 1))}  // Keep accumulator to track of row counts as fold is done
             .filter(t => t._2 > 1)  // Filter to tuples with more than one row

I'm new to scala myself, I actually spent a while answering this as practice haha. Confusing, but it makes sense!

Think of a Map like a dictionary. You can store pairs in it. In scala, you can add/update a key/value pair by adding a pair to it. Map(b -> 4) + ("c" -> 2) would return Map(b -> 4, c -> 2). Expanding on that, Map(b -> 4, c -> 2) + ("b" -> 1) returns Map(b -> 1, c -> 2). What acc is (renamed from count for clarity) is the accumulator of a growing object as the iterator is folded. Each time it hits a new row, it is checking to see if that row has is in the Map yet (again, think dictionary). If the value is there, it takes the previous value with getOrElse and adds 1 to it, then updates the acc Map with that new pair, or it initializes it at one if it doesn't exist yet (since it was the first time the row was seen).

Here is the best blog I found for learning folding. The author describes it succinctly and accurately: https://coderwall.com/p/4l73-a/scala-fold-foldleft-and-foldright



回答2:

dataSet.groupBy(identity).collect { case (k,v) if v.size > 1 => k }



回答3:

If you use scala collections (Like Seq, List) you have a method called .distinct. Otherwise you can transform it in a Set which removes duplicates by default (but doesn't conserve the order)