Ignore/skip GCS input files that don't exist

Our requirement is to process the last 24 hours of adserving logs that Google DFP writes directly to our GCS bucket.

We currently achieve this by using a Flatten, and passing in all the file names for the last 24 hours. The file names are in yyyyMMdd_hh format.

But, we've identified that sometimes DFP fails to write a file(s) for some of the hours. We've raised that issue to the DFP guys.

However, is there a way to configure our Dataflow job to ignore any missing GCS files, and not fail in that case? It currently fails if one or more files don't exist.

标签： google-cloud-dataflow

2条回答

Animai°情兽

2楼-- · 2019-08-07 14:41

Maybe not the best answer, but you can always use

GcsUtilFactory.create(options).expand(...)

to grab all files which exist. Then you can create Flatten accordingly.

Waiting for more professional answers.

0人赞添加讨论(0) 举报

smile是对你的礼貌

3楼-- · 2019-08-07 14:53

Using Dataflow APIs like TextIO.Read or AvroIO.Read to read from a non-existent file will, of course, thrown an error and cause the pipeline to fail. This is working as intended and I cannot think of a workaround.

Now, reading from a filepattern like yyyyMMdd_* may solve your problem, at least partially. Dataflow will expand the filepattern into a set of files and process them. As long as at least one file exists that matches the pattern provided, the pipeline should proceed.

The approach of having one source per file is often an anti-pattern -- it is less efficient and less elegant, but functionally the same. Nevertheless, you can still fix it by using the Google Cloud Storage API before constructing your Dataflow pipeline to confirm presence of each file. If an input file is not present, you can simply skip generating one of the sources.

Either way, please keep in mind the eventual consistency guarantee provided by the GCS list API. This means that expanding a file pattern may not immediately generate all files that would otherwise be readable. The anti-pattern may be a good workaround for this case, however.

0人赞添加讨论(0) 举报

Ignore/skip GCS input files that don't exist

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间