Spark Structured Streaming and Spark-Ml Regression

2019-01-15 23:00发布

问题:

Is it possible to apply Spark-Ml regression to streaming sources? I see there is StreamingLogisticRegressionWithSGD but It's for older RDD API and I couldn't use It with structured streaming sources.

  1. How I'm supposed to apply regressions on structured streaming sources?
  2. (A little OT) If I cannot use streaming API for regression how can I commit offsets or so to source in a batch processing way? (Kafka sink)

回答1:

Today (Spark 2.2 / 2.3) there is no support for machine learning in Structured Streaming and there is no ongoing work in this direction. Please follow SPARK-16424 to track future progress.

You can however:

  • Train iterative, non-distributed models using forEach sink and some form of external state storage. At a high level regression model could be implemented like this:

    • Fetch latest model when calling ForeachWriter.open and initialize loss accumulator (not in a Spark sense, just local variable) for the partition.
    • Compute loss for each record in ForeachWriter.process and update accumulator.
    • Push loses to external store when calling ForeachWriter.close.
    • This would leave external storage in charge with computing gradient and updating model with implementation dependent on the store.
  • Try to hack SQL queries (see https://github.com/holdenk/spark-structured-streaming-ml by Holden Karau)