I understand that there is no direct UPSERT query one can perform directly from Glue to Redshift. Is it possible to implement the staging table concept within the glue script itself?
So my expectation is creating the staging table, merging it with destination table and finally deleting it. Can it be achieved within the Glue script?
It is possible to implement upsert into Redshift using staging table in Glue by passing 'postactions' option to JDBC sink:
val destinationTable = "upsert_test"
val destination = s"dev_sandbox.${destinationTable}"
val staging = s"dev_sandbox.${destinationTable}_staging"
val fields = datasetDf.toDF().columns.mkString(",")
val postActions =
s"""
DELETE FROM $destination USING $staging AS S
WHERE $destinationTable.id = S.id
AND $destinationTable.date = S.date;
INSERT INTO $destination ($fields) SELECT $fields FROM $staging;
DROP TABLE IF EXISTS $staging
"""
// Write data to staging table in Redshift
glueContext.getJDBCSink(
catalogConnection = "redshift-glue-connections-test",
options = JsonOptions(Map(
"database" -> "conndb",
"dbtable" -> staging,
"overwrite" -> "true",
"postactions" -> postActions
)),
redshiftTmpDir = s"$tempDir/redshift",
transformationContext = "redshift-output"
).writeDynamicFrame(datasetDf)
Make sure the user used for writing to Redshift has sufficient permissions to create/drop tables in the staging schema.
Yes, it can be totally achievable. All you would need is to import pg8000 module into your glue job. pg8000 module is the python library which is used to make connection with Amazon Redshift and execute SQL queries through cursor.
Python Module Reference: https://github.com/mfenniak/pg8000
Then, make connection to your target cluster through pg8000.connect(user='user',database='dbname',host='hosturl',port=5439,password='urpasswrd')
And use the Glue,s datasink option to load into staging table and then run upsert sql query using pg8000 cursor
>>> import pg8000
>>> conn = pg8000.connect(user='user',database='dbname',host='hosturl',port=5439,password='urpasswrd')
>>> cursor = conn.cursor()
>>> cursor.execute("CREATE TEMPORARY TABLE book (id SERIAL, title TEXT)")
>>> cursor.execute("INSERT INTO TABLE final_target"))
>>> conn.commit()
You would need to zip the pg8000 package and put it in s3 bucket and reference it to the Python Libraries path under the Advanced options/Job parameters at Glue Job section.
Apparently connection_options
dictionary parameter in glueContext.write_dynamic_frame.from_jdbc_conf
function has 2 interesting parameters: preactions
and postactions
target_table = "my_schema.my_table"
stage_table = "my_schema.#my_table_stage_table"
pre_query = """
drop table if exists {stage_table};
create table {stage_table} as select * from {target_table} LIMIT 0;""".format(stage_table=stage_table, target_table=target_table)
post_query = """
begin;
delete from {target_table} using {stage_table} where {stage_table}.id = {target_table}.id ;
insert into {target_table} select * from {stage_table};
drop table {stage_table};
end;""".format(stage_table=stage_table, target_table=target_table)
datasink4 = glueContext.write_dynamic_frame.from_jdbc_conf(
frame = datasource0, catalog_connection ="test_red", redshift_tmp_dir='s3://s3path', transformation_ctx="datasink4"
connection_options = {"preactions": pre_query, "postactions": post_query,
"dbtable": stage_table, "database": "redshiftdb"})
Based on https://aws.amazon.com/premiumsupport/knowledge-center/sql-commands-redshift-glue-job/
AWS Glue supports Spark and Databricks libraries, so you can use spark/Pyspark databricks library to do an overwrite of the table:
df.write\
.format("com.databricks.spark.redshift")\
.option("url", redshift_url)\
.option("dbtable", redshift_table)\
.option("user", user)\
.option("password", readshift_password)\
.option("aws_iam_role", redshift_copy_role)\
.option("tempdir", args["TempDir"])\
.mode("overwrite")\
.save()
Per Databricks/Spark documentation:
Overwriting an existing table: By default, this library uses
transactions to perform overwrites, which are implemented by deleting
the destination table, creating a new empty table and appending rows
to it.
You can take a look at databricks documentation in here