How to update a ML model during a spark streaming

2019-07-23 02:47发布

I've got a Spark Streaming job whose goal is to :

read a batch of messages
predict a variable Y given these messages using a pre-trained ML pipeline

The problem is, I'd like to be able to update the model used by the executors without restarting the application.

Simply put, here's what it looks like :

model = #model initialization

def preprocess(keyValueList):
    #do some preprocessing

def predict(preprocessedRDD):
    if not preprocessedRDD.isEmpty():
        df = #create df from rdd
        df = model.transform(df)
        #more things to do

stream = KafkaUtils.createDirectStream(ssc, [kafkaTopic], kafkaParams)

stream.mapPartitions(preprocess).foreachRDD(predict)

In this case, the model is simply used. Not updated.

I've thought about several possibilities but I have now crossed them all out :

broadcasting the model everytime it changes (cannot update it, read-only)
reading the model from HDFS on the executors (it needs the SparkContext so not possible)

Any idea ?

Thanks a lot !

标签： apache-spark machine-learning spark-streaming

2条回答

Animai°情兽

2楼-- · 2019-07-23 03:03

The function you pass to foreachRDD is executed by the driver, it's only the rdd operations themselves that are performed by executors, as such you don't need to serialize the model - assuming you are using a Spark ML pipeline which operates on RDD's, which as far as I know they all do. Spark handles the training/prediction for you, you don't need to manually distribute it.

0人赞添加讨论(0) 举报

smile是对你的礼貌

3楼-- · 2019-07-23 03:25

I've solved this issue before in two different ways:

a TTL on the model
rereading the model on each batch

Both those solutions suppose an additional job training on the data you've accumulated regularly (e.g. once a day).

0人赞添加讨论(0) 举报

How to update a ML model during a spark streaming

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间