Our requirement is to process the last 24 hours of adserving logs that Google DFP writes directly to our GCS bucket.
We currently achieve this by using a Flatten, and passing in all the file names for the last 24 hours. The file names are in yyyyMMdd_hh format.
But, we've identified that sometimes DFP fails to write a file(s) for some of the hours. We've raised that issue to the DFP guys.
However, is there a way to configure our Dataflow job to ignore any missing GCS files, and not fail in that case? It currently fails if one or more files don't exist.
Maybe not the best answer, but you can always use
to grab all files which exist. Then you can create Flatten accordingly.
Waiting for more professional answers.
Using Dataflow APIs like
TextIO.Read
orAvroIO.Read
to read from a non-existent file will, of course, thrown an error and cause the pipeline to fail. This is working as intended and I cannot think of a workaround.Now, reading from a filepattern like
yyyyMMdd_*
may solve your problem, at least partially. Dataflow will expand the filepattern into a set of files and process them. As long as at least one file exists that matches the pattern provided, the pipeline should proceed.The approach of having one source per file is often an anti-pattern -- it is less efficient and less elegant, but functionally the same. Nevertheless, you can still fix it by using the Google Cloud Storage API before constructing your Dataflow pipeline to confirm presence of each file. If an input file is not present, you can simply skip generating one of the sources.
Either way, please keep in mind the eventual consistency guarantee provided by the GCS
list
API. This means that expanding a file pattern may not immediately generate all files that would otherwise be readable. The anti-pattern may be a good workaround for this case, however.