How to get item id from cosine similarity matrix?

I am using Spark Scala to calculate cosine similarity between the Dataframe rows.

Dataframe schema is below:

root
    |-- itemId: string (nullable = true)
    |-- features: vector (nullable = true)

Sample of the dataframe below

    +-------+--------------------+
    | itemId|            features|
    +-------+--------------------+
    | ab    |[4.7143,0.0,5.785...|
    | cd    |[5.5,0.0,6.4286,4...|
    | ef    |[4.7143,1.4286,6....|
    ........
    +-------+--------------------+

Code to compute the cosine similarities:

val irm = new IndexedRowMatrix(myDataframe.rdd.zipWithIndex().map {
      case (row, index) => IndexedRow(row.getAs[Vector]("features"), index)
}).toCoordinateMatrix.transpose.toRowMatrix.columnSimilarities

In the irm matrix, I have (i, j, score) where i, j are the indexes of item i, and j of my original dataframe. What I would like is to get (itemIdA, itemIdB, score) where itemIdA and itemIdB are the ids of index i and j respectively, by joining this irm with the initial dataframe or if there is any better option?

Create a row index before converting the dataframe to a matrix and create a mapping between the index and the id. After the computation, use the created Map to convert the column index (previously row index but changed with the transpose) to the id.

val rdd = myDataframe.as[(String, org.apache.spark.mllib.linalg.Vector)].rdd.zipWithIndex()
val indexMap = rdd.map{case ((id, vec), index) => (index, id)}.collectAsMap()

Calculate the cosine similarities as before using the :

val irm = new IndexedRowMatrix(rdd.map{case ((id, vec), index) => IndexedRow(index, vec)})
  .toCoordinateMatrix().transpose().toRowMatrix().columnSimilarities()

Convert column indices back to the ids:

irm.entries.map(e => (indexMap(e.i), indexMap(e.j), e.value))

This should give you what you are looking for.

How to get item id from cosine similarity matrix?

问题:

回答1:

收藏的人(0)

How to get item id from cosine similarity matrix?

问题:

回答1:

收藏的人(0)

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮