I have a directory on GCS or another supported filesystem to which new files are being written by an external process.
I would like to write an Apache Beam streaming pipeline that continuously watches this directory for new files and reads and processes each new file as it arrives. Is this possible?
This is possible starting with Apache Beam 2.2.0. Several APIs support this use case:
If you're using
TextIO
orAvroIO
, they support this explicitly viaTextIO.read().watchForNewFiles()
and the same onreadAll()
, for example:If you're using a different file format, you may use
FileIO.match().continuously()
andFileIO.matchAll().continuously()
which support the same API, in combination withFileIO.readMatches()
.The APIs support specifying how often to check for new files, and when to stop checking (supported conditions are e.g. "if no new output appears within a given time", "after observing N outputs", "after a given time since starting to check" and their combinations).
Note that right now this feature currently works only in the Direct runner and the Dataflow runner, and only in the Java SDK. In general, it will work in any runner that supports Splittable DoFn (see capability matrix).