given a list of documents, I want to obtain the pairs that shares at least one token. To do this I wrote the code below, that do that through an inverted index.
object TestFlatMap {
case class Document(id : Int, tokens : List[String])
def main(args: Array[String]): Unit = {
val documents = List(
Document(1, List("A", "B", "C", "D")),
Document(2, List("A", "B", "E", "F", "G")),
Document(3, List("E", "G", "H")),
Document(4, List("A", "L", "M", "N"))
)
val expectedTokensIds = List(("A",1), ("A",2), ("A",4), ("B",1), ("B",2), ("C",1), ("D",1), ("E",2), ("E",3), ("F",2), ("G",2), ("G",3), ("H",3), ("L",4), ("M",4), ("N",4)) //Expected tokens - id tuples
val expectedCouples = Set((1, 2), (1, 4), (2, 3), (2, 4)) //Expected resulting pairs
/**
* For each token returns the id of the documents that contains it
* */
val tokensIds = documents.flatMap{ document =>
document.tokens.map{ token =>
(token, document.id)
}
}
//Check if the tuples are right
assert(tokensIds.length == expectedTokensIds.length && tokensIds.intersect(expectedTokensIds).length == expectedTokensIds.length, "Error: tokens-ids not matches")
//Group the documents by the token
val docIdsByToken = tokensIds.groupBy(_._1).filter(_._2.size > 1)
/**
* For each group of documents generate the pairs
* */
val couples = docIdsByToken.map{ case (token, docs) =>
docs.combinations(2).map{ c =>
val d1 = c.head._2
val d2 = c.last._2
if(d1 < d2){
(d1, d2)
}
else{
(d2, d1)
}
}
}.flatten.toSet
/**
* Same operation, but with flatMap
* For each group of documents generate the pairs
* */
val couples1 = docIdsByToken.flatMap{ case (token, docs) =>
docs.combinations(2).map{ c =>
val d1 = c.head._2
val d2 = c.last._2
if(d1 < d2){
(d1, d2)
}
else{
(d2, d1)
}
}
}.toSet
//The results obtained with flatten pass the test
assert(couples.size == expectedCouples.size && couples.intersect(expectedCouples).size == expectedCouples.size, "Error: couples not matches")
//The results obtained with flatMap do not pass the test: they are wrong
assert(couples1.size == expectedCouples.size && couples1.intersect(expectedCouples).size == expectedCouples.size, "Error: couples1 not matches")
}
The problem is that the flatMap that should generates the final results does not works properly, it only returns two couples: (2,3) and (1,2). I do not understand why it does not works, moreover IntelliJ suggests me to use flatMap instead of use map an then flatten.
Someone is able to explain me where the problem is? Because I cannot figure out, I also had this problem in past.
Thanks
Luca