My current setup is:
- Spark 2.3.0 with pyspark 2.2.1
- streaming service using Azure IOTHub/EventHub
- some custom python functions based on pandas, matplotlib, etc
I'm using https://github.com/Azure/azure-event-hubs-spark/blob/master/docs/PySpark/structured-streaming-pyspark-jupyter.md as an example on how to read the data but:
- can't use foreach sink as that is not implemented in python
- when i try to call .rdd, .map or .flatMap i get an exception: "Queries with streaming sources must be executed with writeStream.start()"
What is the correct way to get each element of the stream and pass it through a python function?
Thanks,
Ed
In the first step you define a dataframe reading the data as a stream from your EventHub or IoT-Hub:
The data is stored binary in the body attribute. To get the elements of the body you have to define the structure:
and apply the schema on the body casted as a string:
On the resulting dataframe you can apply functions, e. g. call the custom function u_make_hash on the column 'name':