I am attempting to use long user/product IDs in the ALS model in PySpark MLlib (1.3.1) and have run into an issue. A simplified version of the code is given here:
from pyspark import SparkContext
from pyspark.mllib.recommendation import ALS, Rating
sc = SparkContext("","test")
# Load and parse the data
d = [ "3661636574,1,1","3661636574,2,2","3661636574,3,3"]
data = sc.parallelize(d)
ratings = data.map(lambda l: l.split(',')).map(lambda l: Rating(long(l[0]), long(l[1]), float(l[2])) )
# Build the recommendation model using Alternating Least Squares
rank = 10
numIterations = 20
model = ALS.train(ratings, rank, numIterations)
Running this code yields a java.lang.ClassCastException
because the code is attempting to convert the longs to integers. Looking through the source code, the ml ALS class in Spark allows for long user/product IDs but then the mllib ALS class forces the use of ints.
Question: Is there a workaround to use long user/product IDs in PySpark ALS?