I want to read a rolling window of past 30 days into my pipeline for e.g. on Jan 15 2017, I want to read:
> gs://bucket/20170115/*
> gs://bucket/20170114/*
>.
>.
>.
> gs://bucket/20161216/*
This says ("*", "?", "[..]") glob patterns are supported
Similar question, but with no good example
I am trying to avoid doing 30 Text.IO.Read steps then Flattening all Pcollections into one, this causes hot shards in the pipeline.
One Glob Pattern generation function here
When reading files from GCS, TextIO supports the same wildcard patterns as GCS, described here: Wildcard Names.
In the answer for the question you linked, bullet #2 suggests forming a small number of globs to represent your full range:
TextIO
also has a new APIreadAll()
which allows you to specify input files dynamically as data. This allows you to pass in the exact set of filenames you need:The new
TextIO.readAll()
API has not yet been released, but you can build from master by specifying the Beam artifact version2.2.0-SNAPSHOT
. The 2.2.0 release is in progress and should be available sometime in September.