Map key, value pair based on similarity of their v

2019-09-20 02:29发布

问题:

I have been learning Spark for several weeks, currently I am trying to group several items or people based on their connection using Spark and Hadoop in Scala. For example, I want to see how football players are connected based on their club history. My "players" rdd would be:

(John, FC Sion)
(Mike, FC Sion)
(Bobby, PSV Eindhoven)
(Hans, FC Sion)

I want to have rdd like this:

(John, <Mike, Hans>)
(Mike, <John, Hans>)
(Bobby, <>)
(Hans, <Mike, John>)

I plan to use map to accomplish this.

val splitClubs = players.map(player=> (player._1, parseTeammates(player._2, players)))

Where parseTeammates is a function that will find players that are also playing for same club (player._2)

// RDD is not a type, how can I insert rdd into a function?
def parseTeammates(club: String, rdd: RDD) : List[String] = {
    // will generate a list of players that contains same "club" value
    val playerList = rdd.filter(_._1 == club)
    return playerList.values;
}

I get compilation error, type mismatch since the function is expected to return List[String] but instead playerList.values returns org.apache.spark.rdd.RDD[List[String]]. Can anybody help me to get the value of an RDD in its simple form (in my case, List[String]) ?

Also, I think there is a more elegant way to solve this problem, rather than creating a separate RDD and then find a certain key in the new RDD and then returning the value as a list

回答1:

I think your parseTeammates approach is a little off in the world of RDDs. When it comes to dealing with RDDs and potentially really, REALLY large amount of data, you don't want to do this kind of nested looping. Try instead to re-organize your data.

The code below will get you what you want

players.map{case(player, club) => (club, List(player))}
   .reduceByKey(_++_)
   .flatMap{case(_, list) =>list.zipWithIndex.map{case(player, index) => (player, list.take(index) ++ list.drop(index+1))}}

Note that I first organize the data according to the club they played for and then afterwards combine the players to yield the result in the format you are looking for.

I hope this helps.



回答2:

A different take on @Glennie's solution (who IMO is right about your initial approach being unfit).

TL;DR;

players.map { case (player, team) => (team, mutable.HashSet[String](player)) }
  .reduceByKey(_++=_)
  .flatMap {
      case (team, players) => {
        for (player <- players)
          yield (player, players - player)
      }
  }

The basic idea is the same (build a list of teammattes keyed by teams, and flatMap this result). But I suggest using other building blocks for the same result. Whether or not this is a win depends on taste, and the performance caracteristics of your dataset.

Different take on the reduceByKey

Reducing by key, here, involves concatenating a collection (of players) with one or many more players. If we take the original code of :

players.map{case(player, club) => (club, List(player))}
   .reduceByKey(_++_)

Internally, we will end up calling something like (as of scala 1.4) :

def add: (List[String], List[String]) => List[String] = _++_;

players.map { case (player, team) => (team, List(player)) }
       .combineByKey(
           // The first time we see a new team on each partition
           (list: List[String]) => list, 
           // invoked each time we fusion a player in its team's list
           // (e.g. map side combine) 
           add, 
           // invoked each time we fusion each team's partial lists
           // (e.g. reduce side combine)
           add)

The takeaway here is that the add operation (_++_) is called a lot of times. So it better be optimized.
In this respect, we know that List performs poorly, because each mutation entails a whole copying of the existing list into another. Please note : "poorly" may actually be irrelevant. If you have millions of teams, and only 20 players per team, then the ++ performance might be dwarfed by other spark computations involved in the reducing.

(Out of the top of my head, there's worse : if the List becomes really big, seeing some of the operations involved in its serialization are implemented recursively, we might hit a stackoverflow. I'd have to check on that).

So we might benefit from switching to a mutable collection, like so :

players.map { case (player, team) => (team, mutable.ArrayBuffer[String](player)) }
  .reduceByKey(_++=_)
  1. We now have a mutable collection, for which concatenation is optimized
  2. We use ++= instead of ++, so that each time, we do not even have to allocate a brand new collection when fusionning two existing ones
  3. If we know or dataset well, we may be able to pre-size the buffer as to have predictable memory allocation, and avoid as much as possible buffer resizing. Or switch implementation, accordingly.

Different take on the flatMap

Gains of using a mutable collection

The original implementation uses, once more, extensive list operations like take and drop, combined with a zip with index.

The use of a mutable collection serves us better in terms of readability here, as we can replace 3 immutable list copies (take, drop, ++) :

list.take(index) ++ list.drop(index+1)

With only one (- performs a clone)

list - list(index)

Alternative syntax

We can also provide a totally different implementation, avoiding the zipping with index to leverage for comprehensions :

  .flatMap {
      case (team, players) => {
        for (player <- players)
          yield (player, players - player)
      }
    }

Note that the players - player step involves a lookup of the player in the list. Using an ArrayBuffer, this is a O(n) operation. So we may, once again, depending of the dataset, prefer the use of a mutable.HashSet as a mutable collection instead of an array buffer if we go down this path.

(I was going to add provided we have no duplicate in player names, but this does not matter, because if you have two "John"s in a team, then it's of no use to have two lines in your RDD for the two Johns, it does not hold any more meaning than a single one.)