Read file in order in Google Cloud Dataflow

2019-09-05 07:54发布

I'm using Spotify Scio to read logs that are exported from Stackdriver to Google Cloud Storage. They are JSON files where every line is a single entry. Looking at the worker logs it seems like the file is split into chunks, which are then read in any order. I've already limited my job to exactly 1 worker in this case. Is there a way to force these chunks to be read and processed in order?

As an example (textFile is basically a TextIO.Read):

val sc = ScioContext(myOptions)
sc.textFile(myFile).map(line => logger.info(line))

Would produce output similar to this based on the worker logs:

line 5
line 6
line 7
line 8
<Some other work>
line 1
line 2
line 3
line 4
<Some other work>
line 9
line 10
line 11
line 12

What I want to know is if there's a way to force it to read lines 1-12 in order. I've found that gzipping the file and reading it with the CompressionType specified is a workaround but I'm wondering if there are any ways to do this that don't involve zipping or changing the original file.

标签： google-cloud-platform google-cloud-dataflow spotify-scio

1条回答

\"骚年 ilove

2楼-- · 2019-09-05 08:17

Google Cloud Dataflow / Apache Beam currently do not support sorting or preservation of order in processing pipelines. The drawback of allowing for sorted output is that it outputting such a result for large datasets eventually bottlenecks on a single machine, which is not scalable for large datasets.

0人赞添加讨论(0) 举报

Read file in order in Google Cloud Dataflow

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间