TextIO. Read multiple files from GCS using pattern

2019-04-12 19:38发布

问题:

I tried using the following

TextIO.Read.from("gs://xyz.abc/xxx_{2017-06-06,2017-06-06}.csv")

That pattern didn't work, as I get

java.lang.IllegalStateException: Unable to find any files matching StaticValueProvider{value=gs://xyz.abc/xxx_{2017-06-06,2017-06-06}.csv}

Even though those 2 files do exist. And I tried with a local file using a similar expression

TextIO.Read.from("somefolder/xxx_{2017-06-06,2017-06-06}.csv")

And that did work just fine.

I would've thought there would be support for all kinds of globs for files in GCS, but nope. Why is that? is there away to accomplish what I'm looking for?

回答1:

This may be another option, in addition to Scott's suggestion and your comment on his answer:

You can define a list with the paths you want to read and then iterate over it, creating a number of PCollections in the usual way:

PCollection<String> events1 = p.apply(TextIO.Read.from(path1));
PCollection<String> events2 = p.apply(TextIO.Read.from(path2));

Then create a PCollectionList:

PCollectionList<String> eventsList = PCollectionList.of(events1).and(events2);

And then flatten this list into your PCollection for your main input:

PCollection<String> events = eventsList.apply(Flatten.pCollections());



回答2:

Glob patterns work slightly differently in Google Cloud Storage vs. the local filesystem. Apache Beam's TextIO.Read transform will defer to the underlying filesystem to interpret the glob.

GCS glob wildcard patterns are documented here (Wildcard Names).

In the case above, you could use:

TextIO.Read.from("gs://xyz.abc/xxx_2017-06-*.csv")

Note however that this will also include any other matching files.



回答3:

Did you try Apache Beam TextIO.Read from function? Here, it says that it is possible with gcs as well:

public TextIO.Read from(java.lang.String filepattern)

Reads text files that reads from the file(s) with the given filename or filename pattern. This can be a local path (if running locally), or a Google Cloud Storage filename or filename pattern of the form "gs://<bucket>/<filepath>" (if running locally or using remote execution service).

Standard Java Filesystem glob patterns ("*", "?", "[..]") are supported.