I would like to compare consecutive rows in a big file (~1GB) using awk/python (since I use big files, I would prefer to use awk) command. Here is an example of input and output:
Input file
#x y
1 11 # Remarks (not part of the input file)
10 12 # (Remark *1)
10 17 #
4 14
20 15 # (Remark *2)
20 16 #
20 17 #
20 22 #
5 19
10 20
(Remark *1): since the x-value of this row and the x-value of the consecutive row/line are the same, this line or the next line (RANDOM selection) should be printed in the outputfile
(Remark *2): since the x-value of this row and the x-value of the next 3 lines are the same, this line or ONE of the next 3 lines (RANDOM selection) should be printed in the outputfile
The output file I wanted to have is like this:
#x y
1 11
10 17
4 14
20 17
5 19
10 20
or (since random selection, if the same x-values appear in consecutive rows)
#x y
1 11
10 12
4 14
20 16
5 19
10 20
Basically I want to compare if the x-value of the current line/row is the same as the x-value of the next consecutive lines/rows. If not, the current line should be printed. If yes, only one random line should be selected of the consecutive lines/rows with the same x-values (the y-values are not important for comparison).
I hope, somebody can help me!