I am using Data factory to create our staging area, the problem is whenever source data changes, we add a new row to staging tables.
For instance, assume we have the following data:
ID Fields created edited
100 ---------- '2017-07-01' '2017-07-05'
this will be stored in our staging tables like this:
ID Fields created edited
100 ---------- '2017-07-01' null
100 ---------- '2017-07-01' '2017-07-05'
Selecting the most recent row is expensive and We don't want that. How do you think we can avoid storing duplicate IDs in staging?
I assume on creating the pipelines, there should be a way to update the data if the ID already exists in staging.
the query format in data factory is like this:
$$Text.Format('select * from <<table>> where <<column>> >= \'{0:yyyy-MM-dd HH:mm}\' AND <<column>> < \'{1:yyyy-MM-dd HH:mm}\'', WindowStart, WindowEnd)