Compare consecutive rows in awk/(or python) and ra

2020-07-28 00:04发布

问题:

I would like to compare consecutive rows in a big file (~1GB) using awk/python (since I use big files, I would prefer to use awk) command. Here is an example of input and output:

Input file

#x   y
1    11        # Remarks (not part of the input file)  
10   12        # (Remark *1)
10   17        #
4    14
20   15        # (Remark *2)
20   16        #
20   17        #
20   22        #
5    19
10   20

(Remark *1): since the x-value of this row and the x-value of the consecutive row/line are the same, this line or the next line (RANDOM selection) should be printed in the outputfile

(Remark *2): since the x-value of this row and the x-value of the next 3 lines are the same, this line or ONE of the next 3 lines (RANDOM selection) should be printed in the outputfile

The output file I wanted to have is like this:

#x   y
1    11
10   17
4    14
20   17
5    19
10   20

or (since random selection, if the same x-values appear in consecutive rows)

#x   y
1    11
10   12
4    14
20   16
5    19
10   20

Basically I want to compare if the x-value of the current line/row is the same as the x-value of the next consecutive lines/rows. If not, the current line should be printed. If yes, only one random line should be selected of the consecutive lines/rows with the same x-values (the y-values are not important for comparison).

I hope, somebody can help me!

回答1:

$ cat tst.awk
function prtBuf(        idx) {
    if (cnt > 0) {
        idx = int((rand() * cnt) + 1)
        print buf[idx]
    }
    cnt = 0
}

BEGIN { srand() }
$1 != prev { prtBuf() }
{ buf[++cnt]=$0; prev=$1 }
END { prtBuf() }

$ awk -f tst.awk file
1    11        # Remarks (not part of the input file)
10   17        #
4    14
20   17        #
5    19
10   20

$ awk -f tst.awk file
1    11        # Remarks (not part of the input file)
10   12        # (Remark *1)
4    14
20   22        #
5    19
10   20

I assumed the x and y column headers from your example weren't actually part of your input file and so removed them. If they do exist and you want them in the output then just add a NR==1{print;next} line up front.