I have created a Glue job that copies data from S3 (csv file) to Redshift. It works and populates the desired table.
However, I need to purge the table during this process as I am left with duplicate records after the process completes.
I'm looking for a way to add this purge to the Glue process. Any advice would be appreciated.
Thanks.
You can use spark/Pyspark databricks library to do an append after a truncate table of the table (this is better performance than an overwrite):
You can take a look at databricks documentation in here
Did you have a look at Job Bookmarks in Glue? It's a feature for keeping the high water mark and works with s3 only. I am not 100% sure, but it may require partitioning to be in place.
You need to modify the auto generated code provided by Glue. Connect to redshift using spark jdbc connection and execute the purge query.
To spin up Glue containers in redshift VPC; specify the connection in glue job, to gain access for redshift cluster.
Hope this helps.