Filtering nested JSON in AWS Glue

2019-06-14 14:25发布

问题:

We would like to use an AWS-Glue Job to filter JSON messages within an s3 bucket.

Here is some example JSON:

{ "property": {"subproperty1": "A", "subproperty2": "B" }}
{ "property": {"subproperty1": "C", "subproperty2": "D" }}

We want to filter on subproperty1 in ["A", "B"]. This is what we try:

applyFilter1 = Filter.apply(
  frame = datasource0, 
  f = lambda x: x["property.subproperty1"] in ["A", "B"]
)

Output is then written so a new s3 bucket as follows:

datasink2 = glueContext.write_dynamic_frame.from_options(
    frame = applyFilter1, 
    connection_type = "s3", 
    connection_options = {"path": "s3://<my-s3-location>"}, 
    format = "json", 
    transformation_ctx = "datasink2"
)

Unfortunately the resulting file is empty. Any idea? Is filtering nested expressions like this supported in AWS Glue?