Create Large CSV data using Google Cloud Dataflow

2019-08-03 12:20发布

问题:

I need to create a large csv file of ~ 2 billion records with header. It takes a long time to create using standalone script however since records are not related , I understand Cloud dataflow can make it distributed spinning up multiple worker GCE machines of my choice. Does cloud dataflow always need to have an input. Here I am trying to programmatically generate data of following format

ItemId,   ItemQuantity, ItemPrice, Salevalue, SaleDate
item0001, 25          , 100      , 2500      , 2017-03-18
item0002, 50          , 200      , 10000     , 2017-03-25 

Note

ItemId can be postfixed with any random number between 0001 to 9999 ItemQuantity can be random value between (1 to 1000) ItemPrice can be random value between (1 to 100) SaleValue = ItemQuantity*ItemPrice Date between 2015-01-01 to 2017-12-31

Any language is fine.

Continueing from question Generating large file using Google Cloud Dataflow

回答1:

Currently, there is not a very elegant way of doing this. In Python you would do this (same thing for Java, just the syntax changes):

def generate_keys():
  for i in range(2000):
    # Generate 2000 key-value pairs to shuffle
    yield (i, 0)

def generate_random_elements():
  for i in range(1000000):
    yield random_element()

p = beam.Pipeline(my_options)
(p 
 | beam.Create(['any']) 
 | beam.FlatMap(generate_keys)
 | beam.GroupByKey()
 | beam.FlatMap(generate_random_elements)
 | beam.WriteToText('gs://bucket-name/file-prefix'))

In generate_keys() we are generating 2000 different keys, and then we run GroupByKey so that they will be shuffled to different workers. We need to do this, because the DoFn can not currently be split across several workers. (Once SplittableDoFn is implemented, this will be much easier).

As a note, when Dataflow writes results out to sinks, it commonly separates them into different files (e.g. gs://bucket-name/file-prefix-0000-00001, and so on), so you'll need to condense the files together.

Also, you can use --num_workers 10, or however many to spawn in Dataflow, or use autoscaling.