Troubleshooting apache beam pipeline import errors

2019-03-02 07:46发布

I have a bunch of text files (~1M) stored on google cloud storage. When I read these files into Google Cloud DataFlow pipeline for processing, I always get the following error:

Total size of the BoundedSource objects returned by BoundedSource.split() operation is larger than the allowable limit

The trouble shooting page says:

You might encounter this error if you're reading from a very large number of files via TextIO, AvroIO or some other file-based source. The particular limit depends on the details of your source (e.g. embedding schema in AvroIO.Read will allow fewer files), but it is on the order of tens of thousands of files in one pipeline.

Does that mean I have to split my files into smaller batches, rather than import them all at once?

I'm using dataflow python sdk for developing pipelines.

标签： python google-cloud-storage google-cloud-dataflow dataflow apache-beam

1条回答

唯我独甜

2楼-- · 2019-03-02 07:51

Splitting your files into batches is a reasonable workaround - e.g. read them using multiple ReadFromText transforms, or using multiple pipelines. I think at the level of 1M files, the first approach will not work. It's better to use a new feature:

The best way to read a very large number of files is using ReadAllFromText. It does not have scalability limitations (though it will perform worse if your number of files is very small).

It will be available in Beam 2.2.0, but it is already available at HEAD if you're willing to use a snapshot build.

0人赞添加讨论(0) 举报

Troubleshooting apache beam pipeline import errors

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间