I'm using the Python Beam SDK 0.6.0. And I would like to write my output to JSON in Google Cloud Storage. What is the best way to do this?
I quess I can use WriteToText
from the Text IO sink but then I have to format each row separately, right? How do I save my result into valid JSON files that contain lists of objects?
Making each file contain a single list with a bunch of elements is difficult, because you'd need to group a bunch of elements and then write them together to a file. Let me advice you to use a different format.
You may consider the JSON Lines format, where each line in a file represents a single JSON element.
Transforming your data to JSON Lines should be pretty easy. The following transform should do the trick:
Finally, if you later on want to read your JSON Lines files, you can write your own JsonLinesSource or use the one in beam_utils.
Ok, for reference, I solved this by writing my own sink building on the
_TextSink
used byWriteToText
in the beam SDK.Not sure if this is the best way to do it but it works well so far. Hope it might help someone else.
Using the sink is similar to how you use the the text sink:
Although this is a year late, I'd like to add another way to write a result to json files in GCS. For apache beam 2.x pipelines, this transform works:
For example: