I have a bunch of text files (~1M) stored on google cloud storage. When I read these files into Google Cloud DataFlow pipeline for processing, I always get the following error:
Total size of the BoundedSource objects returned by BoundedSource.split() operation is larger than the allowable limit
The trouble shooting page says:
You might encounter this error if you're reading from a very large number of files via TextIO, AvroIO or some other file-based source. The particular limit depends on the details of your source (e.g. embedding schema in AvroIO.Read will allow fewer files), but it is on the order of tens of thousands of files in one pipeline.
Does that mean I have to split my files into smaller batches, rather than import them all at once?
I'm using dataflow python sdk for developing pipelines.