How to update a ML model during a spark streaming

2019-07-23 02:47发布

I've got a Spark Streaming job whose goal is to :

  • read a batch of messages
  • predict a variable Y given these messages using a pre-trained ML pipeline

The problem is, I'd like to be able to update the model used by the executors without restarting the application.

Simply put, here's what it looks like :

model = #model initialization

def preprocess(keyValueList):
    #do some preprocessing

def predict(preprocessedRDD):
    if not preprocessedRDD.isEmpty():
        df = #create df from rdd
        df = model.transform(df)
        #more things to do

stream = KafkaUtils.createDirectStream(ssc, [kafkaTopic], kafkaParams)

stream.mapPartitions(preprocess).foreachRDD(predict)

In this case, the model is simply used. Not updated.

I've thought about several possibilities but I have now crossed them all out :

  • broadcasting the model everytime it changes (cannot update it, read-only)
  • reading the model from HDFS on the executors (it needs the SparkContext so not possible)

Any idea ?

Thanks a lot !

2条回答
Animai°情兽
2楼-- · 2019-07-23 03:03

The function you pass to foreachRDD is executed by the driver, it's only the rdd operations themselves that are performed by executors, as such you don't need to serialize the model - assuming you are using a Spark ML pipeline which operates on RDD's, which as far as I know they all do. Spark handles the training/prediction for you, you don't need to manually distribute it.

查看更多
smile是对你的礼貌
3楼-- · 2019-07-23 03:25

I've solved this issue before in two different ways:

  • a TTL on the model
  • rereading the model on each batch

Both those solutions suppose an additional job training on the data you've accumulated regularly (e.g. once a day).

查看更多
登录 后发表回答