Sort and remove duplicates based on column

2020-04-05 07:57发布

问题:

I have a text file:

$ cat text
542,8,1,418,1
542,9,1,418,1
301,34,1,689070,1
542,9,1,418,1
199,7,1,419,10

I'd like to sort the file based on the first column and remove duplicates using sort, but things are not going as expected.

Approach 1

$ sort -t, -u -b -k1n text
542,8,1,418,1
542,9,1,418,1
199,7,1,419,10
301,34,1,689070,1

It is not sorting based on the first column.

Approach 2

$ sort -t, -u -b -k1n,1n text
199,7,1,419,10
301,34,1,689070,1
542,8,1,418,1

It removes the 542,9,1,418,1 line but I'd like to keep one copy.

It seems that the first approach removes duplicate but not sorts correctly, whereas the second one sorts right but removes more than I want. How should I get the correct result?

回答1:

The problem is that when you provide a key to sort the unique occurrences are looked for that particular field. Since the line 542,8,1,418,1 is displayed, sort sees the next two lines starting with 542 as duplicate and filters them out.

Your best bet would be to either sort all columns:

sort -t, -nk1,1 -nk2,2 -nk3,3 -nk4,4 -nk5,5 -u text

or

use awk to filter duplicate lines and pipe it to sort.

awk '!_[$0]++' text | sort -t, -nk1,1


回答2:

When sorting on a key, you must provide the end of the key as well, otherwise sort uses all following keys as well.

The following should work:

sort -t, -u -k1,1n text