Print duplicate count without removing duplicates

2019-09-14 17:58发布

问题:

I am new to working with the Terminal on mac and have a large .tsv file that consists of a list of items, and two values next to it. I would like to be able to print the number of duplicates next to the first occurrence of the item Without removing additional data.

I am aware of cut -f 1 | sort | uniq -c but this removes a lot of valuable data I would like to keep for analysis. I'm reading about awk and grep but I think I could use a little help.

This is an example of the file I'm trying to process:

fruit   number  reference
apple   12  342
apple   13  345
apple   43  4772
banana  19  234
banana  73  3242
peach   131 53423
peach   234 3266
peach   242 324
peach   131 56758
peaches 29  2434

Ideally, the output would look something like this:

fruit   number  reference   fruit_count
apple   12  342 3
apple   13  345 
apple   43  4772    
banana  19  234 2
banana  73  3242    
peach   131 53423   4
peach   234 3266    
peach   242 324 
peach   131 56758   
peaches 29  2434    1

Is something like this even possible? I can get the desired output excel using formulas, but the file is too large and keeps crashing on me. Any help would be appreciated.

EDIT: Adding My current solution (that does not meet my requirements)

cut -f 1 fruitsample.txt | sort | uniq -c | sed -e 's/ *//' -e 's/ / /'

This gives me the intended counts, replacing the standard count + space output from uniq -c with a tab character, but it also sorts the header row and removes the 2nd and third columns.

On Excel, I can use the formula =IF(COUNTIF(A$2:A2,A2)=1,COUNTIF(A:A,A2),"") and fill it down. The file I'm working with is nearly 680K rows of data, and Excel chokes trying to calculate that many rows.

As I mentioned, I am a beginner looking for guidance. I'm just not as familiar with awk or grep. Thanks again!

回答1:

awk to the rescue!

awk 'NR==FNR {a[$1]++; next} 
     FNR==1  {print $0, "fruit_count"; next} 
     $1 in a {$(NF+1)=a[$1]; delete a[$1]}1' file{,} | 
column -t

fruit    number  reference  fruit_count
apple    12      342        3
apple    13      345
apple    43      4772
banana   19      234        2
banana   73      3242
peach    131     53423      4
peach    234     3266
peach    242     324
peach    131     56758
peaches  29      2434       1

for explanation of the main idea I'll use a simpler structure without header, and unsorted data

$ cat file
apple
banana
apple
apple
cherry
banana

$ awk 'NR==FNR {a[$1]++; next}            # in the first pass, save key counts
                $1 in a                   # if the key in map
                        {$(NF+1)=a[$1];   # add the count as a last column
                         delete a[$1]}    # remove key from map
                1                         # print
       ' file{,} |                        # bash shorthand for: file file
  column -t                               # pretty print columns 


apple   3
banana  2
apple
apple
cherry  1
banana

for the simplified example, using unix toolchain you can achieve the same with

join -a1 -11 -22 -o1.2,2.1 <(cat -n file) <(cat -n file | sort -k2 | uniq -c -f1)

adding header will require more juggling; it's where awk shines.



回答2:

Another using awk and double-tacs:

$ tac file | awk '
NR>1 {print q, (p==$1?"":++c)}                  # p previous first field, q previous record
     {c=(p==$1?c+1:0); p=$1; q=$0}              # c is the counter
END  {print q, "fruit_count"}
' | tac
fruit   number  reference fruit_count
apple   12  342 3
apple   13  345
apple   43  4772
banana  19  234 2
banana  73  3242
peach   131 53423 4
peach   234 3266
peach   242 324
peach   131 56758
peaches 29  2434 1


回答3:

This does what you want in a single pass of the input file and only stores the values for 1 fruit at a time in memory so it shouldn't have performance or memory issues despite your file being too big for MS-Excel:

$ cat tst.awk
NR==1 { print $0, "fruit_count"; next }
$1 != prev { prt() }
{ arr[++cnt] = $0; prev = $1 }
END { prt() }
function prt(    i) {
    if (cnt) {
        print arr[1], cnt
        for (i=2; i <= cnt; i++) {
            print arr[i]
        }
        delete arr
        cnt = 0
    }
}

$ awk -f tst.awk file | column -t
fruit    number  reference  fruit_count
apple    12      342        3
apple    13      345
apple    43      4772
banana   19      234        2
banana   73      3242
peach    131     53423      4
peach    234     3266
peach    242     324
peach    131     56758
peaches  29      2434       1