I have some general question about AWS Glue and its crawlers. I have some data streams into S3 buckets and I use AWS Athena to access them as external tables in redshift.
The tables are partitioned by hour, some glue crawlers update the partitions and the table structure every hour.
The Problem is that the crawlers take longer and longer and someday they will not finish in less than an hour.
Is there some setting in order to speed up this process or some proper alternative to the crawlers in AWS Glue?
Unfortunately there are not config options for Glue Crawlers to tune performance. However, as far as I know AWS Glue team should release a feature that improves performance of crawlers significantly (don't know the date though).
In general, there are few ways to register new partitions in Data Catalog:
- Run a Glue Crawler
- Run
MSCK REPAIR TABLE <table>
Athena query
- Add partition via Athena
- Add partition via Glue API
The most efficient way is to add partition manually (3 or 4). So if you know when and which new partitions should be registered then you can setup a lambda function to call Athena or a Glue API. The lambda itself might be triggered by SNS or CloudWatch event.