It is possible to read unnested JSON files on Cloud Storage with Dataflow via:
p.apply("read logfiles", TextIO.Read.from("gs://bucket/*").withCoder(TableRowJsonCoder.of()));
If I just want to write those logs with minimal filtering to BigQuery I can do so by using a DoFn like this one:
private static class Formatter extends DoFn<TableRow,TableRow> {
@Override
public void processElement(ProcessContext c) throws Exception {
// .clone() since input is immutable
TableRow output = c.element().clone();
// remove misleading timestamp field
output.remove("@timestamp");
// set timestamp field by using the element's timestamp
output.set("timestamp", c.timestamp().toString());
c.output(output);
}
}
}
However, I don't know how to access nested fields in the JSON file this way.
- If the TableRow contains a
RECORD
namedr
, is it possible to access its keys/values without further serialization/deserialization? - If I need to serialize/deserialize myself with the
Jackson
library, does it make more sense to use a the standardCoder
ofTextIO.Read
instead ofTableRowJsonCoder
, to gain some of the performance back that I loose this way?
EDIT
The files are new-line delimited, and look something like this:
{"@timestamp":"2015-x", "message":"bla", "r":{"analyzed":"blub", "query": {"where":"9999"}}}
{"@timestamp":"2015-x", "message":"blub", "r":{"analyzed":"bla", "query": {"where":"1111"}}}
Your best bet is probably to do what you described in #2 and use Jackson directly. It makes the most sense to let the TextIO read do what it is built for -- reading lines from a file with the string coder -- and then use a
DoFn
to actually parse the elements. Something like the following:Note that you could also do this using multiple ParDos.