I'm reading through this blog post:
It discusses about using Spark Streaming and Apache Kafka to do some near real time processing. I completely understand the article. It does show how I could use Spark Streaming to read messages from a Topic. I would like to know if there is a Spark Streaming API that I can use to write messages into Kakfa topic?
My use case is pretty simple. I have a set of data that I can read from a given source at a constant interval (say every second). I do this using reactive streams. I would like to do some analytics on this data using Spark. I want to have fault-tolerance, so Kafka comes into play. So what I would essentially do is the following (Please correct me if I was wrong):
- Using reactive streams get the data from external source at constant intervals
- Pipe the result into Kafka topic
- Using Spark Streaming, create the streaming context for the consumer
- Perform analytics on the consumed data
One another question though, is the Streaming API in Spark an implementation of the reactive streams specification? Does it have back pressure handling (Spark Streaming v1.5)?
But Spark Streaming 1.5 has internal back-pressure-based dynamic throttling. There's some work to extend that beyond throttling in the pipeline. This throttling is compatible with the Kafka direct stream API.
You can write to Kafka in a Spark Streaming application, here's one example.
(Full disclosure: I'm one of the implementers of some of the back-pressure work)
You can look at the link here for a number of examples on how to do it.