AWS Glue Truncate Redshift Table

2019-07-15 02:00发布

I have created a Glue job that copies data from S3 (csv file) to Redshift. It works and populates the desired table.

However, I need to purge the table during this process as I am left with duplicate records after the process completes.

I'm looking for a way to add this purge to the Glue process. Any advice would be appreciated.

Thanks.

标签： python amazon-web-services pyspark amazon-redshift aws-glue

3条回答

祖国的老花朵

2楼-- · 2019-07-15 02:04

You can use spark/Pyspark databricks library to do an append after a truncate table of the table (this is better performance than an overwrite):

preactions = "TRUNCATE table <schema.table>" 
df.write\
  .format("com.databricks.spark.redshift")\
  .option("url", redshift_url)\
  .option("dbtable", redshift_table)\
  .option("user", user)\
  .option("password", readshift_password)\
  .option("aws_iam_role", redshift_copy_role)\
  .option("tempdir", args["TempDir"])\
  .option("preactions", preactions)\
  .mode("append")\
  .save()

You can take a look at databricks documentation in here

0人赞添加讨论(0) 举报

男人必须洒脱

3楼-- · 2019-07-15 02:06

Did you have a look at Job Bookmarks in Glue? It's a feature for keeping the high water mark and works with s3 only. I am not 100% sure, but it may require partitioning to be in place.

0人赞添加讨论(0) 举报

\"骚年 ilove

4楼-- · 2019-07-15 02:24

You need to modify the auto generated code provided by Glue. Connect to redshift using spark jdbc connection and execute the purge query.

To spin up Glue containers in redshift VPC; specify the connection in glue job, to gain access for redshift cluster.

Hope this helps.

0人赞添加讨论(0) 举报

AWS Glue Truncate Redshift Table

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间