AWS Glue predicate push down condition has no effe

2019-05-22 23:32发布

问题:

I have a MySQL source from which I am creating a Glue Dynamic Frame with predicate push down condition as follows

datasource = glueContext.create_dynamic_frame_from_catalog(
    database = source_catalog_db, 
    table_name = source_catalog_tbl, 
    push_down_predicate = "id > 1531812324", 
    transformation_ctx = "datasource")

I am always getting all the records in 'datasource' whatever the condition I put in 'push_down_predicate'. What am I missing?

回答1:

Pushdown predicate works for partitioning columns only. In other words, your data files should be placed in hierarchically structured folders. For example, if data is located in s3://bucket/dataset/ and partitioned by year, month and day then the structure should be following:

s3://bucket/dataset/year=2018/month=7/day=18/<data-files-here>

In such case pushdown predicate would work for columns year, month and day only:

datasource = glueContext.create_dynamic_frame_from_catalog(
    database = source_catalog_db, 
    table_name = source_catalog_tbl, 
    push_down_predicate = "year = 2017 and month > 6 and day between 3 and 10", 
    transformation_ctx = "datasource")

Besides that you have to keep in mind that pushdown predicates work with s3 data sources only.

Here is a nice blog post written by AWS Glue devs about data partitioning.