I know that we can combine(like cbind in R) two RDDs as below in pyspark:
rdd3 = rdd1.zip(rdd2)
I want to perform the same for two Dstreams in pyspark. Is it possible or any alternatives?
In fact, I am using a MLlib randomforest model to predict using spark streaming.
In the end, I want to combine the feature Dstream & prediction Dstream together for further downstream processing.
Thanks in advance.
-Obaid
In the end, I am using below.
The trick is using "native python map" along with "spark spreaming transform".
May not an elegent way, however it works :).
def predictScore(texts, modelRF):
predictions = texts.map( lambda txt : (txt , getFeatures(txt)) ).\
map(lambda (txt, features) : (txt ,(features.split(','))) ).\
map( lambda (txt, features) : (txt, ([float(i) for i in features])) ).\
transform( lambda rdd: sc.parallelize(\
map( lambda x,y:(x,y), modelRF.predict(rdd.map(lambda (x,y):y)).collect(),rdd.map(lambda (x,y):x).collect() )\
)\
)
# in the transform operation: x=text and y=features
# Return will be tuple of (score,'original text')
return predictions
Hope, it will help somebody who is facing same problem.
If anybody has better idea, please post it here.
-Obaid
Note: I also submitted the problem on spark user list and post my answer there as well.