Is there a way to 'uniq' by column?

I have a .csv file like this:

stack2@example.com,2009-11-27 01:05:47.893000000,example.net,127.0.0.1
overflow@example.com,2009-11-27 00:58:29.793000000,example.net,255.255.255.0
overflow@example.com,2009-11-27 00:58:29.646465785,example.net,256.255.255.0
...

I have to remove duplicate e-mails (the entire line) from the file (i.e. one of the lines containing overflow@example.com in the above example). How do I use uniq on only field 1 (separated by commas)? According to man, uniq doesn't have options for columns.

I tried something with sort | uniq but it doesn't work.

标签： linux shell sorting uniq

8条回答

成全新的幸福

2楼-- · 2019-01-04 16:58

awk -F"," '!_[$1]++' file

-F sets the field separator.
$1 is the first field.
_[val] looks up val in the hash _(a regular variable).
++ increment, and return old value.
! returns logical not.
there is an implicit print at the end.

0人赞添加讨论(0) 举报

我欲成王，谁敢阻挡

3楼-- · 2019-01-04 16:58

By sorting the file with sort first, you can then apply uniq.

It seems to sort the file just fine:

$ cat test.csv
overflow@domain2.com,2009-11-27 00:58:29.793000000,xx3.net,255.255.255.0
stack2@domain.com,2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1
overflow@domain2.com,2009-11-27 00:58:29.646465785,2x3.net,256.255.255.0 
stack2@domain.com,2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1
stack3@domain.com,2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1
stack4@domain.com,2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1
stack2@domain.com,2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1

$ sort test.csv
overflow@domain2.com,2009-11-27 00:58:29.646465785,2x3.net,256.255.255.0 
overflow@domain2.com,2009-11-27 00:58:29.793000000,xx3.net,255.255.255.0
stack2@domain.com,2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1
stack2@domain.com,2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1
stack2@domain.com,2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1
stack3@domain.com,2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1
stack4@domain.com,2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1

$ sort test.csv | uniq
overflow@domain2.com,2009-11-27 00:58:29.646465785,2x3.net,256.255.255.0 
overflow@domain2.com,2009-11-27 00:58:29.793000000,xx3.net,255.255.255.0
stack2@domain.com,2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1
stack3@domain.com,2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1
stack4@domain.com,2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1

You could also do some AWK magic:

$ awk -F, '{ lines[$1] = $0 } END { for (l in lines) print lines[l] }' test.csv
stack2@domain.com,2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1
stack4@domain.com,2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1
stack3@domain.com,2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1
overflow@domain2.com,2009-11-27 00:58:29.646465785,2x3.net,256.255.255.0

0人赞添加讨论(0) 举报

老娘就宠你

4楼-- · 2019-01-04 16:59

well, simpler than isolating the column with awk, if you need to remove everything with a certain value for a given file, why not just do grep -v:

e.g. to delete everything with the value "col2" in the second place line: col1,col2,col3,col4

grep -v ',col2,' file > file_minus_offending_lines

If this isn't good enough, because some lines may get improperly stripped by possibly having the matching value show up in a different column, you can do something like this:

awk to isolate the offending column: e.g.

awk -F, '{print $2 "|" $line}'

the -F sets the field delimited to ",", $2 means column 2, followed by some custom delimiter and then the entire line. You can then filter by removing lines that begin with the offending value:

 awk -F, '{print $2 "|" $line}' | grep -v ^BAD_VALUE

and then strip out the stuff before the delimiter:

awk -F, '{print $2 "|" $line}' | grep -v ^BAD_VALUE | sed 's/.*|//g'

(note -the sed command is sloppy because it doesn't include escaping values. Also the sed pattern should really be something like "[^|]+" (i.e. anything not the delimiter). But hopefully this is clear enough.

0人赞添加讨论(0) 举报

我命由我不由天

5楼-- · 2019-01-04 17:02

or if u want to use uniq:

<mycvs.cvs tr -s ',' ' ' | awk '{print $3" "$2" "$1}' | uniq -c -f2

gives:

1 01:05:47.893000000 2009-11-27 tack2@domain.com
2 00:58:29.793000000 2009-11-27 overflow@domain2.com
1

0人赞添加讨论(0) 举报

男人必须洒脱

6楼-- · 2019-01-04 17:03

To consider multiple column.

Sort and give unique list based on column 1 and column 3:

sort -u -t : -k 1,1 -k 3,3 test.txt

-t : colon is separator
-k 1,1 -k 3,3 based on column 1 and column 3

0人赞添加讨论(0) 举报

男人必须洒脱

7楼-- · 2019-01-04 17:06

sort -u -t, -k1,1 file

-u for unique
-t, so comma is the delimiter
-k1,1 for the key field 1

Test result:

overflow@domain2.com,2009-11-27 00:58:29.793000000,xx3.net,255.255.255.0 
stack2@domain.com,2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1

0人赞添加讨论(0) 举报

1 2 下一页

Is there a way to 'uniq' by column?

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间