Copy data from Amazon S3 to Redshift and avoid dup

I am copying data from Amazon S3 to Redshift. During this process, I need to avoid the same files being loaded again. I don't have any unique constraints on my Redshift table. Is there a way to implement this using the copy command?

http://docs.aws.amazon.com/redshift/latest/dg/r_COPY_command_examples.html

I tried adding unique constraint and setting column as primary key with no luck. Redshift does not seem to support unique/primary key constraints.

标签： amazon-web-services amazon-s3 duplicates copy amazon-redshift

4条回答

▲ chillily

2楼-- · 2019-03-10 17:32

Currently there is no way to remove duplicates from redshift. Redshift doesn't support primary key/unique key constraints, and also removing duplicates using row number is not an option (deleting rows with row number greater than 1) as the delete operation on redshift doesn't allow complex statements (Also the concept of row number is not present in redshift).

The best way to remove duplicates is to write a cron/quartz job that would select all the distinct rows, put them in a separate table and then rename the table to your original table.

Insert into temp_originalTable (Select Distinct from originalTable)

Drop table originalTable

Alter table temp_originalTable rename to originalTable

0人赞添加讨论(0) 举报

看我几分像从前

3楼-- · 2019-03-10 17:53

My solution is to run a 'delete' command before 'copy' on the table. In my use case, each time I need to copy the records of a daily snapshot to redshift table, thus I can use the following 'delete' command to ensure duplicated records are deleted, then run the 'copy' command.

DELETE from t_data where snapshot_day = 'xxxx-xx-xx';

0人赞添加讨论(0) 举报

疯言疯语

4楼-- · 2019-03-10 17:53

As user1045047 mentioned, Amazon Redshift doesn't support unique constraints, so I had been looking for the way to delete duplicate records from a table with a delete statement. Finally, I found out a reasonable way.

Amazon Redshift supports creating an IDENTITY column that is stored an auto-generated unique number. http://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_TABLE_NEW.html

The following sql is for PostgreSQL to delete duplicated records with OID that is unique column, and you can use this sql by replacing OID with the identity column.

DELETE FROM duplicated_table WHERE OID > (
　SELECT MIN(OID) FROM duplicated_table d2
　　WHERE column1 = d2.dupl_column1
　　AND column2 = d2.column2
);

Here is an example that I tested on my Amazon Redshift cluster.

create table auto_id_table (auto_id int IDENTITY, name varchar, age int);

insert into auto_id_table (name, age) values('John', 18);
insert into auto_id_table (name, age) values('John', 18);
insert into auto_id_table (name, age) values('John', 18);
insert into auto_id_table (name, age) values('John', 18);
insert into auto_id_table (name, age) values('John', 18);
insert into auto_id_table (name, age) values('Bob', 20);
insert into auto_id_table (name, age) values('Bob', 20);  
insert into auto_id_table (name, age) values('Matt', 24); 

select * from auto_id_table order by auto_id; 
 auto_id | name | age 
---------+------+-----
       1 | John |  18
       2 | John |  18
       3 | John |  18
       4 | John |  18
       5 | John |  18
       6 | Bob  |  20
       7 | Bob  |  20
       8 | Matt |  24    
(8 rows) 

delete from auto_id_table where auto_id > (
  select min(auto_id) from auto_id_table d
    where auto_id_table.name = d.name
    and auto_id_table.age = d.age
);

select * from auto_id_table order by auto_id;
 auto_id | name | age 
---------+------+-----
       1 | John |  18
       6 | Bob  |  20
       8 | Matt |  24
(3 rows)

Also it works with COPY command like this.

auto_id_table.csv
```
John,18
Bob,20
Matt,24
```

copy sql

copy auto_id_table (name, age) from '[s3-path]/auto_id_table.csv' CREDENTIALS 'aws_access_key_id=[your-aws-key-id] ;aws_secret_access_key=[your-aws-secret-key]' delimiter ',';

The advantage of this way is that you don't need to run DDL statements. However it doesn't work with existing tables that do not have an identity column because an identity column cannot be added to an existing table. The only way to delete duplicated records with existing tables is migrating all records like this. (same as user1045047's answer)

insert into temp_table (select distinct from original_table);
drop table original_table;
alter table temp_table rename to original_table;

0人赞添加讨论(0) 举报

够拽才男人

5楼-- · 2019-03-10 17:54

Mmm..

What about just never loading data into your master table directly.

Steps to avoid duplication:

begin transaction
bulk load into a temp staging table
delete from master table where rows = staging table rows
insert into master table from staging table (merge)
drop staging table
end transaction.

this is also ~~super~~ somewhat fast, and recommended by redshift docs.

0人赞添加讨论(0) 举报

Copy data from Amazon S3 to Redshift and avoid dup

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间