AWS Glue Truncate Redshift Table

2019-07-15 02:00发布

I have created a Glue job that copies data from S3 (csv file) to Redshift. It works and populates the desired table.

However, I need to purge the table during this process as I am left with duplicate records after the process completes.

I'm looking for a way to add this purge to the Glue process. Any advice would be appreciated.

Thanks.

3条回答
祖国的老花朵
2楼-- · 2019-07-15 02:04

You can use spark/Pyspark databricks library to do an append after a truncate table of the table (this is better performance than an overwrite):

preactions = "TRUNCATE table <schema.table>" 
df.write\
  .format("com.databricks.spark.redshift")\
  .option("url", redshift_url)\
  .option("dbtable", redshift_table)\
  .option("user", user)\
  .option("password", readshift_password)\
  .option("aws_iam_role", redshift_copy_role)\
  .option("tempdir", args["TempDir"])\
  .option("preactions", preactions)\
  .mode("append")\
  .save()

You can take a look at databricks documentation in here

查看更多
男人必须洒脱
3楼-- · 2019-07-15 02:06

Did you have a look at Job Bookmarks in Glue? It's a feature for keeping the high water mark and works with s3 only. I am not 100% sure, but it may require partitioning to be in place.

查看更多
\"骚年 ilove
4楼-- · 2019-07-15 02:24

You need to modify the auto generated code provided by Glue. Connect to redshift using spark jdbc connection and execute the purge query.

To spin up Glue containers in redshift VPC; specify the connection in glue job, to gain access for redshift cluster.

Hope this helps.

查看更多
登录 后发表回答