I have the following Python test code (the arguments to ALS.train
are defined elsewhere):
r1 = (2, 1)
r2 = (3, 1)
test = sc.parallelize([r1, r2])
model = ALS.train(ratings, rank, numIter, lmbda)
predictions = model.predictAll(test)
print test.take(1)
print predictions.count()
print predictions
Which works, because it has a count of 1 against the predictions variable and outputs:
[(2, 1)]
1
ParallelCollectionRDD[2691] at parallelize at PythonRDD.scala:423
However, when I try and use an RDD
I created myself using the following code, it doesn't appear to work anymore:
model = ALS.train(ratings, rank, numIter, lmbda)
validation_data = validation.map(lambda xs: tuple(int(x) for x in xs))
predictions = model.predictAll(validation_data)
print validation_data.take(1)
print predictions.count()
print validation_data
Which outputs:
[(61, 3864)]
0
PythonRDD[4018] at RDD at PythonRDD.scala:43
As you can see, predictAll
comes back empty when passed the mapped RDD
. The values going in are both of the same format. The only noticeable difference that I can see is that the first example uses parallelize and produces a ParallelCollectionRDD
whereas the second example just uses a map which produces a PythonRDD
. Does predictAll
only work if passed a certain type of RDD
? If so, is it possible to convert between RDD
types? I'm not sure how to get this working.
There are two basic conditions under which
MatrixFactorizationMode.predictAll
may return a RDD with lower number of items than the input:You can easily reproduce this behavior and check that it is is not dependent on the way how RDD has been created. First lets use example data to build a model:
Next lets see which products and users are present in the training data:
Now lets create test data and check predictions:
So far so good. Next lets map it using the same logic as in your code:
Still fine. Next lets create invalid data and repeat experiment:
As expected there are no predictions for invalid input.
Finally you can confirm this is really the case by using ML model which is completely independent in training / prediction from Python code:
As you can see no corresponding user / item in the training data means no prediction.