I have 2 CSV files that i need to compare and get the difference to a newly formatted file. The samples are given below.
OLD file
DTL,11111111,1111111111111111,11111111111,Y,N,xx,xx
DTL,22222222,2222222222222222,22222222222,Y,Y,cc,cc
DTL,33333333,3333333333333333,33333333333,Y,Y,dd,dd
DTL,44444444,4444444444444444,44444444444,Y,Y,ss,ss
DTL,55555555,5555555555555555,55555555555,Y,Y,qq,qq
NEW file
DTL,11111111,1111111111111111,11111111111,Y,Y,xx,xx
DTL,22222222,2222222222222222,22222222222,Y,N,cc,cc
DTL,44444444,4444444444444444,44444444444,Y,Y,ss,ss
DTL,55555555,5555555555555555,55555555555,Y,Y,qq,qq
DTL,77777777,7777777777777777,77777777777,N,N,ee,ee
Output file
I want to compare the old and new CSV files and to find the changes that has effected in the new file and UPDATE a FLAG to denote these changes
U - if the new file record is UPDATED D - if a record existing in the old file is deleted in the new file N - if a record existing in the new file is not available in the old file
the sample output file is this.
DTL,11111111,1111111111111111,11111111111,Y,Y,xx,xx U
DTL,22222222,2222222222222222,22222222222,Y,N,cc,cc U
DTL,33333333,3333333333333333,33333333333,Y,Y,dd,dd D
DTL,77777777,7777777777777777,77777777777,N,N,ee,ee N
I used diff command but it will repeat the UPDATED record too which is not I want.
DTL,11111111,1111111111111111,11111111111,Y,N,xx,xx
DTL,22222222,2222222222222222,22222222222,Y,Y,cc,cc
DTL,33333333,3333333333333333,33333333333,Y,Y,dd,dd
---
DTL,11111111,1111111111111111,11111111111,Y,Y,xx,xx
DTL,22222222,2222222222222222,22222222222,Y,N,cc,cc
5a5
DTL,77777777,7777777777777777,77777777777,N,N,ee,ee
I used an AWK single line command to filter out my records as well
awk 'NR==FNR{A[$1];next}!($1 in A)' FS=: old.csv new.csv
the problem with this is is doesnt get me the records only belonging to the OLD file. which is
DTL,33333333,3333333333333333,33333333333,Y,Y,dd,dd
I initiated an driven bash script as well to ahieve this but didnt find much help with a good example.
myscript.awk
BEGIN {
FS = "," # input field seperator
OFS = "," # output field seperator
}
NR > 1 {
#flag
# N - new record D- Deleted U - Updated
id = $1
name = $2
flag = 'N'
# This prints the columns in the new order. The commas tell Awk to use the character set in OFS
print id,name,flag
}
>> awk -f myscript.awk old.csv new.csv > formatted.csv
This might work for you:
an awk solution along the same lines:
A good starting point would probably be:
This outputs:
Meaning that it Added a record on line 5 (5a) and changed the records on lines 1 and 3 (1,3c).
If you can't use this format as-is (which would be good to use a standard) then you would need to write a script which converts it to the format that you describe.