My goal is to benchmark the latency and the throughput of Apache Beam on a streaming data use-case with different window queries.
I want to create my own data with an on-the-fly data generator to control the data generation rate manually and consume this data directly from a pipeline without a pub/sub mechanism, i.e. I don't want to read the data from a broker, etc. to avoid bottlenecks. Is there a way of doing something similar to what I want to achieve? or is there any source code for such use-case with Beam SDKs? So far I couldn't find a starting point, existing code samples use pub/sub mechanism and they assume data comes from somewhere.
Thank you for suggestions in advance.
With regards to On-the-fly data, one option would be to make use of GenerateSequence for example:
To create other types of objects you can use a ParDo after to consume the Long and make it into something else:
That should generate values like:
Some other things to keep in mind for your pipelines performance testing:
The Direct runner is designed for unit testing, it does cool things like simulate failures, this helps catch issues which will be seen when running a production pipeline. It is not designed to help with performance testing however. I would recommend always using a main runner for those types of integration tests.
Please be aware of the Fusion optimization Link to Docs, when using a artificial data source like GenerateSequence you may need to do a GBK as the next step to allow the work to be parallelized. For the Dataflow runner more info can be found here: Link to Docs
In general for performance testing, I would recommend testing the whole end to end pipeline. There are interactions with sources and sinks ( for example watermarks ) which will not be tested in a standalone pipeline.
Hope that helps.