可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I've got a Spark Streaming job whose goal is to :

read a batch of messages
predict a variable Y given these messages using a pre-trained ML pipeline

The problem is, I'd like to be able to update the model used by the executors without restarting the application.

Simply put, here's what it looks like :

model = #model initialization

def preprocess(keyValueList):
    #do some preprocessing

def predict(preprocessedRDD):
    if not preprocessedRDD.isEmpty():
        df = #create df from rdd
        df = model.transform(df)
        #more things to do

stream = KafkaUtils.createDirectStream(ssc, [kafkaTopic], kafkaParams)

stream.mapPartitions(preprocess).foreachRDD(predict)

In this case, the model is simply used. Not updated.

I've thought about several possibilities but I have now crossed them all out :

broadcasting the model everytime it changes (cannot update it, read-only)
reading the model from HDFS on the executors (it needs the SparkContext so not possible)

Any idea ?

Thanks a lot !

回答1:

I've solved this issue before in two different ways:

a TTL on the model
rereading the model on each batch

Both those solutions suppose an additional job training on the data you've accumulated regularly (e.g. once a day).

回答2:

The function you pass to foreachRDD is executed by the driver, it's only the rdd operations themselves that are performed by executors, as such you don't need to serialize the model - assuming you are using a Spark ML pipeline which operates on RDD's, which as far as I know they all do. Spark handles the training/prediction for you, you don't need to manually distribute it.