Fetch all the duplicate records from a list withou

Below is my list which contains the column names: country, gender & age.

scala> funList
res1: List[(String, String, String)] = List((india,M,15), (usa,F,25), (australia,M,35), (kenya,M,55), (russia,M,75), (china,T,95), (england,F,65), (germany,F,25), (finland,M,45), (australia,F,35))

My goal is to find the duplicate records with the combination of (country,age). Please note that I want to only fetch all the duplicate records and ignore others. And list should also contain other column values with the duplicate records.

Output should be like this:

australia,M,35
australia,F,35

It would be nice if you do without groupBy operation and without n*square complexity. GroupBy is fine unless it messes up my output.

回答1:

Without groupBy(). Not sure about the complexity.

val keys = funList.map{case (a,b,c) => (a,c)}  //isolate elements of interest
val dups = keys diff keys.distinct             //find the duplicates
funList.filter{case (a,b,c) => dups.contains((a,c))}
//res0: List[(String, String, String)] = List((australia,M,35), (australia,F,35))

回答2:

Not sure of what you mean by groupBy's possibility to mess with the output. You can use it as follows and you'll get back the list of duplicates you're looking for:

// input
val items = List(("india","M",15), ("usa","F",25), ("australia","M",35), ("kenya","M",55), ("russia","M",75), ("china","T",95), ("england","F",65), ("germany","F",25), ("finland","M",45), ("australia","F",35))

items.
  groupBy { case (nation, _, age) => nation -> age }. // group by relevant items
  filter(_._2.length > 1).                            // keep only duplicates
  flatMap(_._2)                                       // get them and flatten the result

Alternatively you may be interested in using groupBy as the basis for your own function that buckets values by a key and filter the result by some predicate, like the following:

implicit class FilterGroups[A, CC[X] <: Iterable[X]](self: CC[A]) {
  import scala.collection.mutable
  import scala.collection.mutable.Builder
  import scala.collection.generic.CanBuildFrom
  def filterGroups[K, That](f: A => K)(p: CC[A] => Boolean)(implicit bfs: CanBuildFrom[CC[A], A, CC[A]], bf: CanBuildFrom[CC[A], A, That]): That = {
    val m = mutable.Map.empty[K, Builder[A, CC[A]]]
    for (elem <- self) {
      val key = f(elem)
      val bldr = m.getOrElseUpdate(key, bfs())
      bldr += elem
    }
    val b = bf()
    for {
      (_, v) <- m
      group = v.result if p(group)
      elem <- group
    } b += elem
    b.result
  }
}

You'd then invoke it as follows:

// bucket by the first function, filter by the second one
items.filterGroups(tuple => (tuple._1, tuple._3))(_.length > 1)

And, just as above, get back the desired list of items:

List((australia,M,35), (australia,F,35))

The only main advantage of the alternative solution is that the output type is the same as the input, while using groupBy forces your result type to be an Iterable[(String, String, Int)]. Not sure if you maybe meant this by messing up the output.

Either way, I'd say the time complexity is effectively linear (we have to make one pass to bucket and one to filter, but still we can get rid of constants in big-O notation). This of course means that space complexity is bound to the collection size as well (since we bucket the original results).

One final note: you may want to not use lists when you are measuring size, as its complexity is linear. Both my solution and the one using groupBy use builders of the same type of the original collection, so you may want to use a Vector or some other collection with O(1) for computing length.

But the right answer is probably to just use groupBy, which is simpler and clear to any other Scala developer (although you probably also want to use a lazy view through the iteration to prevent unnecessary double passes over the data).