Presently, we have a DMS task that will take the contents of a MySQL DB and dump files to S3 in parquet format.
The format for the timestamps in parquet ends up being TIMESTAMP_MICROS.
This is a problem as Presto (the underlying implementation of Athena) does not support timestamps in microsecond precision and makes the assumption that all timestamps are in millisecond precision.
This does not cause any errors directly but it makes the times display as some extreme future date as it is interpreting the number of microseconds as number of milliseconds.
We are currently working around this by creating Athena views on top of the Athena tables that take these these timestamps, convert them to a unix timestamp (i.e. number), divide by 1000 to convert to millisecond resolution and then convert back to a timestamp.
This was fine for a proof of concept and identifying the problem but this won't work for us as we have lots of tables with lots of timestamp columns and don't want to introduce this fragile view layer on top.
Support tickets were created for both the Athena and DMS teams and both confirmed the problem and created feature requests on their end to create a native solution for this incompatible but neither support engineer could offer a built-in workaround. DMS does not seem to support changing the column type to be millisecond precision or a varchar string. For Athena, this is simply a documented Presto limitation.
Has anyone run into this before and found a solution?
Changing the precision in the MySQL database is not really an option as it is used by many clients and we don't want to risk breaking the interface.
I am considering creating a job that will run anytime a new parquet is created to look for TIMESTAMP_MICRO fields and column them to TIMESTAMP_MILLIS. Not exactly sure the best way to go about this (Glue/Spark job?)
DMS was great and has made the dump to parquet super simple, there may be other ways to extract the data from MySQL. Any other tools that can do this without much custom development?