“gsutil ls” shows a different list every time

2019-07-27 06:21发布

问题:

We are using GCS as the data sink of a dataflow pipeline, and for some reason the output directory "shows" a different list of files every time I try "gsutil ls" on the directory. Specifically, the number of files should be exactly 4,000 (as the pipeline was specified to shard the output to 4,000 files). However, the list I see is some of those 4,000 files ($prefix-?????-of-04000) and some of the temp files ($prefix-temp-*). It's been 10+ hours since the dataflow job (2016-12-18_19_30_32-7274262445792076535) completed, and I am still seeing different file lists (it's not just increasing, but sometimes decreasing meaning some files disappear and then appear again). This is affect other dataflow pipelines we run which read from this directory.

Is this Dataflow issue or GCS issue, and how can we resolve this? I've seen this behavior of GCS before, but it was usually for the first few minutes after a dataflow pipeline was completed, but this time it seems to be on-going for a while.

回答1:

GCS's list operation is eventually consistent. This may mean that listing a bucket only returns partial data for a period of time.

If you look at a specific file from the 4000, is it consistently there?

Update: There was a temporary issue with GCS causing inconsistent results for list buckets: https://status.cloud.google.com/incident/storage/16036