How to use long user ID in PySpark ALS

2019-05-23 16:02发布

I am attempting to use long user/product IDs in the ALS model in PySpark MLlib (1.3.1) and have run into an issue. A simplified version of the code is given here:

from pyspark import SparkContext
from pyspark.mllib.recommendation import ALS, Rating

sc = SparkContext("","test")

# Load and parse the data
d = [ "3661636574,1,1","3661636574,2,2","3661636574,3,3"]
data = sc.parallelize(d)
ratings = data.map(lambda l: l.split(',')).map(lambda l: Rating(long(l[0]), long(l[1]), float(l[2])) )

# Build the recommendation model using Alternating Least Squares
rank = 10
numIterations = 20
model = ALS.train(ratings, rank, numIterations)

Running this code yields a java.lang.ClassCastException because the code is attempting to convert the longs to integers. Looking through the source code, the ml ALS class in Spark allows for long user/product IDs but then the mllib ALS class forces the use of ints.

Question: Is there a workaround to use long user/product IDs in PySpark ALS?

标签： apache-spark pyspark apache-spark-mllib

1条回答

SAY GOODBYE

2楼-- · 2019-05-23 16:23

This is known issue (https://issues.apache.org/jira/browse/SPARK-2465), but it will not be solved soon, because changing interface to long userId should slowdown computation.

There are few solutions:

you can hash userId to int with hash() function, since it cause just random row compression in few cases, collisions shouldn't affect accuracy of your recommender, really. Discussion in first link.
you can generate unique int userIds with RDD.zipWithUniqueId() or less fast RDD.zipWithIndex, just like in this thread: How to assign unique contiguous numbers to elements in a Spark RDD

0人赞添加讨论(0) 举报

How to use long user ID in PySpark ALS

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间