AWS Glue Crawlers and large tables stored in S3

2019-08-14 07:03发布

站内文章 / 前沿技术

70 0

男人必须洒脱

女 | 书童

私信

可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I have some general question about AWS Glue and its crawlers. I have some data streams into S3 buckets and I use AWS Athena to access them as external tables in redshift. The tables are partitioned by hour, some glue crawlers update the partitions and the table structure every hour.

The Problem is that the crawlers take longer and longer and someday they will not finish in less than an hour. Is there some setting in order to speed up this process or some proper alternative to the crawlers in AWS Glue?

回答1:

Unfortunately there are not config options for Glue Crawlers to tune performance. However, as far as I know AWS Glue team should release a feature that improves performance of crawlers significantly (don't know the date though).

In general, there are few ways to register new partitions in Data Catalog:

Run a Glue Crawler
Run MSCK REPAIR TABLE <table> Athena query
Add partition via Athena
Add partition via Glue API

The most efficient way is to add partition manually (3 or 4). So if you know when and which new partitions should be registered then you can setup a lambda function to call Athena or a Glue API. The lambda itself might be triggered by SNS or CloudWatch event.