Reading from compressed files in Dataflow

Is there a way (or any kind of hack) to read input data from compressed files? My input consists of a few hundreds of files, which are produced as compressed with gzip and uncompressing them is somewhat tedious.

Thanks, Genady

标签： google-cloud-dataflow

4条回答

我只想做你的唯一

2楼-- · 2019-03-02 23:28

I just noticed that specifying the compression type is now available in the latest version of the SDK (v0.3.150210). I've tested it, and was able to load my GZ files directly from GCS to BQ without any problems.

0人赞添加讨论(0) 举报

Explosion°爆炸

3楼-- · 2019-03-02 23:29

I also found that for files that reside in the cloud store, setting the content type and content encoding appears to "just work" without the need for a workaround.

Specifically - I run

gsutil -m setmeta -h "Content-Encoding:gzip" -h "Content-Type:text/plain" <path>

0人赞添加讨论(0) 举报

Melony?

4楼-- · 2019-03-02 23:31

Reading from compressed text sources is now supported in Dataflow (as of this commit). Specifically, files compressed with gzip and bzip2 can be read from by specifying the compression type:

TextIO.Read.from(myFileName).withCompressionType(TextIO.CompressionType.GZIP)

However, if the file has a .gz or .bz2 extension, you don't have do do anything: the default compression type is AUTO, which examines file extensions to determine the correct compression type for a file. This even works with globs, where the files that result from the glob may be a mix of .gz, .bz2, and uncompressed.

0人赞添加讨论(0) 举报

一夜七次

5楼-- · 2019-03-02 23:34

The slower performance with my work around was most likely because Dataflow was putting most of the files in the same split so they weren't being processed in parallel. You can try the following to speed things up.

Create a PCollection for each file by applying the Create transform multiple times (each time to a single file).
Use the Flatten transform to create a single PCollection containing all the files from PCollections representing individual files.
Apply your pipeline to this PCollection.

0人赞添加讨论(0) 举报

Reading from compressed files in Dataflow

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间