Real-time data standardization / normalization wit

2019-05-28 16:25发布

Standardizing / normalizing data is an essential, if not a crucial, point when it comes to implementing machine learning algorithms. Doing so on a real time manner using Spark structured streaming has been a problem I've been trying to tackle for the past couple of weeks.

Using the StandardScaler estimator ((value(i)-mean) /standard deviation) on historical data proved to be great, and in my use case it is the best, to get reasonable clustering results, but I'm not sure how to fit StandardScaler model with real-time data. Structured streaming does not allow it. Any advice would be highly appreciated!

In other words, how to fit models in Spark structured streaming?

1条回答
该账号已被封号
2楼-- · 2019-05-28 17:03

I got an answer for this. It's not possible at the moment to do real time machine learning with Spark structured streaming, inluding normalization; however, for some algorithms making real time predictions is possible if an offline model was built/fitted.

Check:

JIRA - Add support for Structured Streaming to the ML Pipeline API

Google DOC - Machine Learning on Structured Streaming

查看更多
登录 后发表回答