H all,
I have a requirement where in I need to re-ingest some of my older data. We have a multi staged pipeline , the source of which is a Kafka topic. Once a record is fed into that, it runs through a series of steps(about 10). Each step massages the original JSON object pushed to the source topic and pushes to a destination topic.
Now, sometimes, we need to re ingest the older data and apply a subset of the steps I described above. We intend to push these re-ingest records to a different topic, so as not to block the "live" data that's coming through, This might mean I may need to apply just 1 step from the 10 I described above. Running it through the entire pipeline from above is wasteful, as each step is pretty resource intensive and calls multiple external services. Also, I might need to re-ingest millions of entries at time, so I might choke my external services. Lastly, these re ingsetion activities are not that frequent and may happen just once a week at times.
Let's say if I am able to figure out what steps from above I need to execute. That can be done via a basic rules engine. Once that has been done, then I need to be able to dynamically create a topology/ be able to deploy it which starts processing from the newly created topic. Again, the reason I want to deploy on runtime is that these activities, albeit business critical, don't happen that frequently. And every time, the steps I need to execute might change, so we can't always have the entire pipeline running.
Is there a way to achieve this? Or may be am I even thinking in the right direction i.e is the approach I outlined above is even correct? Any pointers would be helpful.