Eliminate duplicates and Insert Unique records hav

2019-09-08 07:38发布

enter image description here

I have an excel file which gets updated on a daily basis i.e the data is always different every time.

I am pulling the data from the Excel sheet into the table using Talend. I have a primary key Company_ID defined in the table.

The error I am facing is that the Excel sheet has few duplicate Company_ID values. It will also pick up more duplicate values in the future as the Excel file will be updated daily.

I want to choose the first record where the Company ID field is 1 and the record doesn't have null in the rest of the columns. Also, for a Company_ID of 3 there is a null value for one column which is ok since it is a unique record for that company_id.

How do I choose a unique row which has maximum no. of column values present ie for eg in the case of Company ID of 1 in Talend ?

1条回答
聊天终结者
2楼-- · 2019-09-08 08:16

tUniqRow is usually the easiest way to handle duplicates. If you are worried that the first row coming to tUniqRow may not be the first row that you want there, you can sort your rows, so they enter tUniqRow in your preferred order:

DIAGRAM (used components: tFileInputExcel, tJavaRow, tSortRow, tUniqRow, tFilterColumns)

In your particular case, the tJava could look like this:

// Code generated according to input schema and output schema
output_row.company_id = input_row.company_id;
output_row.name       = input_row.name;
output_row.et_cetera  = input_row.et_cetera;
// End of pre-generated code

int i = 0;
if (input_row.company_id == null) { i++; }
if (input_row.name       == null) { i++; }
if (input_row.et_cetera  == null) { i++; }
output_row.priority = i;
查看更多
登录 后发表回答