Athena can't resolve CSV files from AWS DMS

2019-06-01 17:44发布

问题:

I've DMS configured to continuously replicate data from MySQL RDS to S3. This creates two type of CSV files: a full load and change data capture (CDC). According to my tests, I have the following files:

testdb/addresses/LOAD001.csv.gz
testdb/addresses/20180405_205807186_csv.gz

After DMS is running properly, I trigger a AWS Glue Crawler to build the Data Catalog for the S3 Bucket that contains the MySQL Replication files, so the Athena users will be able to build queries in our S3 based Data Lake.

Unfortunately the crawlers are not building the correct table schema for the tables stored in S3. For the example above It creates two tables for Athena:

addresses
20180405_205807186_csv_gz

The file 20180405_205807186_csv.gz contains a one line update, but the crawler is not capable of merging the two informations (taking the first load from LOAD001.csv.gz and making the updpate described in 20180405_205807186_csv.gz).

I also tried to create the table in the Athena console, as described in this blog post:https://aws.amazon.com/pt/blogs/database/using-aws-database-migration-service-and-amazon-athena-to-replicate-and-run-ad-hoc-queries-on-a-sql-server-database/. But it does not yield the desired output.

From the blog post:

When you query data using Amazon Athena (later in this post), you simply point the folder location to Athena, and the query results include existing and new data inserts by combining data from both files.

Am I missing something?