How can I build a CoordinateMatrix in Spark using

I am trying to use the Spark implementation of the ALS algorithm for recommendation systems, so I built the DataFrame depicted below, as training data:

|--------------|--------------|--------------|
|    userId    |    itemId    |    rating    |
|--------------|--------------|--------------|

Now, I would like to create a sparse matrix, to represent the interactions between every user and every item. The matrix will be sparse because if there is no interaction between a user and an item, the corresponding value in the matrix will be zero. Thus, in the end, most values will be zero.

But how can I achieve this, using a CoordinateMatrix? I'm saying CoordinateMatrix because I'm using Spark 2.1.1, with python, and in the documentation, I saw that a CoordinateMatrix should be used only when both dimensions of the matrix are huge and the matrix is very sparse.

In other words, how can I get from this DataFrame to a CoordinateMatrix, where the rows would be users, the columns would be items and the ratings would be the values in the matrix?

标签： pyspark spark-dataframe apache-spark-mllib collaborative-filtering

2条回答

爱情/是我丢掉的垃圾

2楼-- · 2020-06-22 05:59

A CoordinateMatrix is just a wrapper for an RDD of MatrixEntrys. A MatrixEntry is just a wrapper over a (long, long, float) tuple. Pyspark allows you to create a CoordinateMatrix from an RDD of such tuples. If the userId and itemId fields are both IntegerTypes and the rating is something like a FloatType, then creating the desired matrix is very straightforward.

from pyspark.mllib.linalg.distributed import CoordinateMatrix

cmat=CoordinateMatrix(df.rdd.map(tuple))

It is only slightly more complicated if you have StringTypes for the userId and itemId fields. You would need to index those strings first and then pass the indices to the CoordinateMatrix.

0人赞添加讨论(0) 举报

放荡不羁爱自由

3楼-- · 2020-06-22 06:06

With Spark 2.4.0, I am showing the whole example that I hope to meet your need. Create dataframe using dictionary and pandas:

my_dict = {
    'userId': [1,2,3,4,5,6],
    'itemId': [101,102,103,104,105,106],
    'rating': [5.7, 8.8, 7.9, 9.1, 6.6, 8.3]
}
import pandas as pd
pd_df = pd.DataFrame(my_dict)
df = spark.createDataFrame(pd_df)

See the dataframe:

df.show()
+------+------+------+
|userId|itemId|rating|
+------+------+------+
|     1|   101|   5.7|
|     2|   102|   8.8|
|     3|   103|   7.9|
|     4|   104|   9.1|
|     5|   105|   6.6|
|     6|   106|   8.3|
+------+------+------+

Create CoordinateMatrix from dataframe:

from pyspark.mllib.linalg.distributed import CoordinateMatrix, MatrixEntry
coorRDD = df.rdd.map(lambda x: MatrixEntry(x[0], x[1], x[2]))
coorMatrix = CoordinateMatrix(coorRDD)

Now see the data type of result:

type(coorMatrix)
pyspark.mllib.linalg.distributed.CoordinateMatrix

0人赞添加讨论(0) 举报

How can I build a CoordinateMatrix in Spark using

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间