I've got a Spark Streaming job whose goal is to :
- read a batch of messages
- predict a variable Y given these messages using a pre-trained ML pipeline
The problem is, I'd like to be able to update the model used by the executors without restarting the application.
Simply put, here's what it looks like :
model = #model initialization
def preprocess(keyValueList):
#do some preprocessing
def predict(preprocessedRDD):
if not preprocessedRDD.isEmpty():
df = #create df from rdd
df = model.transform(df)
#more things to do
stream = KafkaUtils.createDirectStream(ssc, [kafkaTopic], kafkaParams)
stream.mapPartitions(preprocess).foreachRDD(predict)
In this case, the model is simply used. Not updated.
I've thought about several possibilities but I have now crossed them all out :
- broadcasting the model everytime it changes (cannot update it, read-only)
- reading the model from HDFS on the executors (it needs the SparkContext so not possible)
Any idea ?
Thanks a lot !