bash, awk, sed remove XML blocks with duplicate ID

2019-09-11 05:22发布

问题:

I would appreciate any help with this scripting task.

I need to remove every block with a non-unique ID except for one that has the newest date. If the dates are equal, than the last entry within the file should win and be kept undeleted.

The original sorting order of the input has to be preserved.

Input:

<DATA>
<TABLES>

<BLOCK>
<ID V="333"/>
<TEXT/>
<TEXT/>
<DATE V="20160101 00:00:00"/>
<TEXT/>
</BLOCK>

<BLOCK>
<TEXT/>
<TEXT/>
<ID V="4444"/>
<DATE V="20140101 00:00:00"/>
<TEXT/>
<TEXT/>
</BLOCK>

<BLOCK>
<ID V="333"/>
<DATE V="20100101 00:00:00"/>
<TEXT/>
</BLOCK>

<BLOCK>
<TEXT/>
<ID V="4444"/>
<TEXT/>
<TEXT/>
<DATE V="20160101 00:00:00"/>
<TEXT/>
</BLOCK>

<BLOCK>
<TEXT/>
<ID V="7777777"/>
<TEXT/>
<TEXT/>
<DATE V="20130101 00:00:00"/>
<TEXT/>
</BLOCK>

<BLOCK>
<ID V="333"/>
<DATE V="20120101 00:00:00"/>
<TEXT/>
</BLOCK>

<BLOCK>
<TEXT/>
<TEXT/>
<ID V="22"/>
<TEXT/>
<DATE V="20151231 00:00:00"/>
</BLOCK>

<BLOCK>
<TEXT/>
<ID V="7777777"/>
<TEXT/>
<TEXT/>
<DATE V="20130101 00:00:00"/>
<TEXT/>
</BLOCK>

<BLOCK>
<TEXT/>
<ID V="22"/>
<TEXT/>
<TEXT/>
<DATE V="20130101 00:00:00"/>
<TEXT/>
</BLOCK>

</TABLES>
</DATA>

Expected output:

<DATA>
<TABLES>

<BLOCK>
<ID V="333"/>
<TEXT/>
<TEXT/>
<DATE V="20160101 00:00:00"/>
<TEXT/>
</BLOCK>

<BLOCK>
<TEXT/>
<ID V="4444"/>
<TEXT/>
<TEXT/>
<DATE V="20160101 00:00:00"/>
<TEXT/>
</BLOCK>

<BLOCK>
<TEXT/>
<TEXT/>
<ID V="22"/>
<TEXT/>
<DATE V="20151231 00:00:00"/>
</BLOCK>

<BLOCK>
<TEXT/>
<ID V="7777777"/>
<TEXT/>
<TEXT/>
<DATE V="20130101 00:00:00"/>
<TEXT/>
</BLOCK>

</TABLES>
</DATA>

回答1:

It's not entirely clear from your question what output order you want as mentioned in my comment under your question, but this is one interpretation - it will loop through the records in the order they appeared in the input file and print each record only if it were the last one in the file that contained the max date for an id. It will work in any awk on any UNIX system.

$ cat tst.awk
BEGIN { RS=""; ORS="\n\n" }
{
    id = date = $0
    gsub(/.*\n<ID V="|".*/,"",id)
    gsub(/.*\n<DATE V="|".*/,"",date)
}

date >= id2maxDate[id] {
    delete maxDateRecNr2rec[id2maxDateRecNr[id]]
    id2maxDateRecNr[id]  = NR
    maxDateRecNr2rec[NR] = $0
    id2maxDate[id]       = date
}

END {
    for (recNr=1; recNr<=NR; recNr++) {
        if ( recNr in maxDateRecNr2rec ) {
            print maxDateRecNr2rec[recNr]
        }
    }
}

.

$ awk -f tst.awk file
<BLOCK>
<TEXT/>
<ID V="4444"/>
<TEXT/>
<TEXT/>
<DATE V="20160101 00:00:00"/>
<TEXT/>
</BLOCK>

<BLOCK>
<ID V="333"/>
<DATE V="20120101 00:00:00"/>
<TEXT/>
</BLOCK>

<BLOCK>
<TEXT/>
<TEXT/>
<ID V="22"/>
<TEXT/>
<DATE V="20151231 00:00:00"/>
</BLOCK>

<BLOCK>
<TEXT/>
<ID V="7777777"/>
<TEXT/>
<TEXT/>
<DATE V="20130101 00:00:00"/>
<TEXT/>
</BLOCK>

You say date in your question but I'm assuming by that you really mean whatever is in the DATE field of your input so, it doesn't matter for the example you posted since all the times are midnight, but the above uses the date+time, i.e. the entire contents of the DATE field. If you want the time of day to be excluded from the calculations then just change:

    gsub(/.*\n<DATE V="|".*/,"",date)

to

    gsub(/.*\n<DATE V="| .*/,"",date)


回答2:

awk solution tested with awk 3.1.8 - outputs newest record for each ID - preserves relative order of records in input

awk -F'\n' '
BEGIN{                                          
  ORS=RS="\n</BLOCK>"                           
}
{
  id=date=""                                    
  for(i=1; i <=NF; ++i) {                       
    if(id && date) break
    if($i ~ /<ID V=\"[0-9]+\"\/>/)              
      id=$i
    else if($i ~ /<DATE V=\"[0-9]+ [0-9][0-9]:[0-9][0-9]:[0-9][0-9]\"\/>/)
      date=$i
    }
    if(!id) next
    ids[NR]=id
    if(id in recs && date < newest[id]) next
    recs[id]=$0; newest[id]=date; order[id]=NR
  }
END {
  asort(order)
  for(i=1; i<=length(order); ++i) print recs[ids[order[i]]]
  printf("\n")
}
'                                                                           

output for provided sample input

<BLOCK>
<TEXT/>
<ID V="4444"/>
<TEXT/>
<TEXT/>
<DATE V="20160101 00:00:00"/>
<TEXT/>
</BLOCK>

<BLOCK>
<ID V="333"/>
<DATE V="20120101 00:00:00"/>
<TEXT/>
</BLOCK>

<BLOCK>
<TEXT/>
<TEXT/>
<ID V="22"/>
<TEXT/>
<DATE V="20151231 00:00:00"/>
</BLOCK>

<BLOCK>
<TEXT/>
<ID V="7777777"/>
<TEXT/>
<TEXT/>
<DATE V="20130101 00:00:00"/>
<TEXT/>
</BLOCK>                                                                    


标签: bash xslt awk sed