Is it possible to apply Spark-Ml regression to streaming sources? I see there is StreamingLogisticRegressionWithSGD
but It's for older RDD API and I couldn't use It with structured streaming sources.
- How I'm supposed to apply regressions on structured streaming sources?
- (A little OT) If I cannot use streaming API for regression how can I commit offsets or so to source in a batch processing way? (Kafka sink)
Today (Spark 2.2 / 2.3) there is no support for machine learning in Structured Streaming and there is no ongoing work in this direction. Please follow SPARK-16424 to track future progress.
You can however:
Train iterative, non-distributed models using forEach sink and some form of external state storage. At a high level regression model could be implemented like this:
- Fetch latest model when calling
ForeachWriter.open
and initialize loss accumulator (not in a Spark sense, just local variable) for the partition.
- Compute loss for each record in
ForeachWriter.process
and update accumulator.
- Push loses to external store when calling
ForeachWriter.close
.
- This would leave external storage in charge with computing gradient and updating model with implementation dependent on the store.
Try to hack SQL queries (see https://github.com/holdenk/spark-structured-streaming-ml by Holden Karau)