I would appreciate any help with this scripting task.
I need to remove every block with a non-unique ID except for one that has the newest date. If the dates are equal, than the last entry within the file should win and be kept undeleted.
The original sorting order of the input has to be preserved.
Input:
<DATA>
<TABLES>
<BLOCK>
<ID V="333"/>
<TEXT/>
<TEXT/>
<DATE V="20160101 00:00:00"/>
<TEXT/>
</BLOCK>
<BLOCK>
<TEXT/>
<TEXT/>
<ID V="4444"/>
<DATE V="20140101 00:00:00"/>
<TEXT/>
<TEXT/>
</BLOCK>
<BLOCK>
<ID V="333"/>
<DATE V="20100101 00:00:00"/>
<TEXT/>
</BLOCK>
<BLOCK>
<TEXT/>
<ID V="4444"/>
<TEXT/>
<TEXT/>
<DATE V="20160101 00:00:00"/>
<TEXT/>
</BLOCK>
<BLOCK>
<TEXT/>
<ID V="7777777"/>
<TEXT/>
<TEXT/>
<DATE V="20130101 00:00:00"/>
<TEXT/>
</BLOCK>
<BLOCK>
<ID V="333"/>
<DATE V="20120101 00:00:00"/>
<TEXT/>
</BLOCK>
<BLOCK>
<TEXT/>
<TEXT/>
<ID V="22"/>
<TEXT/>
<DATE V="20151231 00:00:00"/>
</BLOCK>
<BLOCK>
<TEXT/>
<ID V="7777777"/>
<TEXT/>
<TEXT/>
<DATE V="20130101 00:00:00"/>
<TEXT/>
</BLOCK>
<BLOCK>
<TEXT/>
<ID V="22"/>
<TEXT/>
<TEXT/>
<DATE V="20130101 00:00:00"/>
<TEXT/>
</BLOCK>
</TABLES>
</DATA>
Expected output:
<DATA>
<TABLES>
<BLOCK>
<ID V="333"/>
<TEXT/>
<TEXT/>
<DATE V="20160101 00:00:00"/>
<TEXT/>
</BLOCK>
<BLOCK>
<TEXT/>
<ID V="4444"/>
<TEXT/>
<TEXT/>
<DATE V="20160101 00:00:00"/>
<TEXT/>
</BLOCK>
<BLOCK>
<TEXT/>
<TEXT/>
<ID V="22"/>
<TEXT/>
<DATE V="20151231 00:00:00"/>
</BLOCK>
<BLOCK>
<TEXT/>
<ID V="7777777"/>
<TEXT/>
<TEXT/>
<DATE V="20130101 00:00:00"/>
<TEXT/>
</BLOCK>
</TABLES>
</DATA>
It's not entirely clear from your question what output order you want as mentioned in my comment under your question, but this is one interpretation - it will loop through the records in the order they appeared in the input file and print each record only if it were the last one in the file that contained the max date for an id. It will work in any awk on any UNIX system.
$ cat tst.awk
BEGIN { RS=""; ORS="\n\n" }
{
id = date = $0
gsub(/.*\n<ID V="|".*/,"",id)
gsub(/.*\n<DATE V="|".*/,"",date)
}
date >= id2maxDate[id] {
delete maxDateRecNr2rec[id2maxDateRecNr[id]]
id2maxDateRecNr[id] = NR
maxDateRecNr2rec[NR] = $0
id2maxDate[id] = date
}
END {
for (recNr=1; recNr<=NR; recNr++) {
if ( recNr in maxDateRecNr2rec ) {
print maxDateRecNr2rec[recNr]
}
}
}
.
$ awk -f tst.awk file
<BLOCK>
<TEXT/>
<ID V="4444"/>
<TEXT/>
<TEXT/>
<DATE V="20160101 00:00:00"/>
<TEXT/>
</BLOCK>
<BLOCK>
<ID V="333"/>
<DATE V="20120101 00:00:00"/>
<TEXT/>
</BLOCK>
<BLOCK>
<TEXT/>
<TEXT/>
<ID V="22"/>
<TEXT/>
<DATE V="20151231 00:00:00"/>
</BLOCK>
<BLOCK>
<TEXT/>
<ID V="7777777"/>
<TEXT/>
<TEXT/>
<DATE V="20130101 00:00:00"/>
<TEXT/>
</BLOCK>
You say date
in your question but I'm assuming by that you really mean whatever is in the DATE
field of your input so, it doesn't matter for the example you posted since all the times are midnight, but the above uses the date+time, i.e. the entire contents of the DATE
field. If you want the time of day to be excluded from the calculations then just change:
gsub(/.*\n<DATE V="|".*/,"",date)
to
gsub(/.*\n<DATE V="| .*/,"",date)
awk solution tested with awk 3.1.8 - outputs newest record for each ID - preserves relative order of records in input
awk -F'\n' '
BEGIN{
ORS=RS="\n</BLOCK>"
}
{
id=date=""
for(i=1; i <=NF; ++i) {
if(id && date) break
if($i ~ /<ID V=\"[0-9]+\"\/>/)
id=$i
else if($i ~ /<DATE V=\"[0-9]+ [0-9][0-9]:[0-9][0-9]:[0-9][0-9]\"\/>/)
date=$i
}
if(!id) next
ids[NR]=id
if(id in recs && date < newest[id]) next
recs[id]=$0; newest[id]=date; order[id]=NR
}
END {
asort(order)
for(i=1; i<=length(order); ++i) print recs[ids[order[i]]]
printf("\n")
}
'
output for provided sample input
<BLOCK>
<TEXT/>
<ID V="4444"/>
<TEXT/>
<TEXT/>
<DATE V="20160101 00:00:00"/>
<TEXT/>
</BLOCK>
<BLOCK>
<ID V="333"/>
<DATE V="20120101 00:00:00"/>
<TEXT/>
</BLOCK>
<BLOCK>
<TEXT/>
<TEXT/>
<ID V="22"/>
<TEXT/>
<DATE V="20151231 00:00:00"/>
</BLOCK>
<BLOCK>
<TEXT/>
<ID V="7777777"/>
<TEXT/>
<TEXT/>
<DATE V="20130101 00:00:00"/>
<TEXT/>
</BLOCK>