Deploy stream processing topology on runtime?

H all,

I have a requirement where in I need to re-ingest some of my older data. We have a multi staged pipeline , the source of which is a Kafka topic. Once a record is fed into that, it runs through a series of steps(about 10). Each step massages the original JSON object pushed to the source topic and pushes to a destination topic.

Now, sometimes, we need to re ingest the older data and apply a subset of the steps I described above. We intend to push these re-ingest records to a different topic, so as not to block the "live" data that's coming through, This might mean I may need to apply just 1 step from the 10 I described above. Running it through the entire pipeline from above is wasteful, as each step is pretty resource intensive and calls multiple external services. Also, I might need to re-ingest millions of entries at time, so I might choke my external services. Lastly, these re ingsetion activities are not that frequent and may happen just once a week at times.

Let's say if I am able to figure out what steps from above I need to execute. That can be done via a basic rules engine. Once that has been done, then I need to be able to dynamically create a topology/ be able to deploy it which starts processing from the newly created topic. Again, the reason I want to deploy on runtime is that these activities, albeit business critical, don't happen that frequently. And every time, the steps I need to execute might change, so we can't always have the entire pipeline running.

Is there a way to achieve this? Or may be am I even thinking in the right direction i.e is the approach I outlined above is even correct? Any pointers would be helpful.

I'd suggest you creating those dynamic topologies for re-ingested data as a separate Kafka Streams application. And if you want to programmatically create such applications on-the-fly and terminate them when done consider the following ways:

Make each of your steps be configurable: you may have a list of knob parameters passed in and depending on them create the re-ingest topology on the fly.
If you want to auto-trigger such re-ingestion pipelines, consider using some deployment tools incorporated to call KafkaStreams#start.