I am new to working with the Terminal on mac and have a large .tsv file that consists of a list of items, and two values next to it. I would like to be able to print the number of duplicates next to the first occurrence of the item Without removing additional data.
I am aware of cut -f 1 | sort | uniq -c but this removes a lot of valuable data I would like to keep for analysis. I'm reading about awk and grep but I think I could use a little help.
This is an example of the file I'm trying to process:
fruit number reference
apple 12 342
apple 13 345
apple 43 4772
banana 19 234
banana 73 3242
peach 131 53423
peach 234 3266
peach 242 324
peach 131 56758
peaches 29 2434
Ideally, the output would look something like this:
fruit number reference fruit_count
apple 12 342 3
apple 13 345
apple 43 4772
banana 19 234 2
banana 73 3242
peach 131 53423 4
peach 234 3266
peach 242 324
peach 131 56758
peaches 29 2434 1
Is something like this even possible? I can get the desired output excel using formulas, but the file is too large and keeps crashing on me. Any help would be appreciated.
EDIT: Adding My current solution (that does not meet my requirements)
cut -f 1 fruitsample.txt | sort | uniq -c | sed -e 's/ *//' -e 's/ / /'
This gives me the intended counts, replacing the standard count + space output from uniq -c with a tab character, but it also sorts the header row and removes the 2nd and third columns.
On Excel, I can use the formula =IF(COUNTIF(A$2:A2,A2)=1,COUNTIF(A:A,A2),"")
and fill it down. The file I'm working with is nearly 680K rows of data, and Excel chokes trying to calculate that many rows.
As I mentioned, I am a beginner looking for guidance. I'm just not as familiar with awk or grep. Thanks again!
awk
to the rescue!
awk 'NR==FNR {a[$1]++; next}
FNR==1 {print $0, "fruit_count"; next}
$1 in a {$(NF+1)=a[$1]; delete a[$1]}1' file{,} |
column -t
fruit number reference fruit_count
apple 12 342 3
apple 13 345
apple 43 4772
banana 19 234 2
banana 73 3242
peach 131 53423 4
peach 234 3266
peach 242 324
peach 131 56758
peaches 29 2434 1
for explanation of the main idea I'll use a simpler structure without header, and unsorted data
$ cat file
apple
banana
apple
apple
cherry
banana
$ awk 'NR==FNR {a[$1]++; next} # in the first pass, save key counts
$1 in a # if the key in map
{$(NF+1)=a[$1]; # add the count as a last column
delete a[$1]} # remove key from map
1 # print
' file{,} | # bash shorthand for: file file
column -t # pretty print columns
apple 3
banana 2
apple
apple
cherry 1
banana
for the simplified example, using unix toolchain you can achieve the same with
join -a1 -11 -22 -o1.2,2.1 <(cat -n file) <(cat -n file | sort -k2 | uniq -c -f1)
adding header will require more juggling; it's where awk
shines.
Another using awk and double-tac
s:
$ tac file | awk '
NR>1 {print q, (p==$1?"":++c)} # p previous first field, q previous record
{c=(p==$1?c+1:0); p=$1; q=$0} # c is the counter
END {print q, "fruit_count"}
' | tac
fruit number reference fruit_count
apple 12 342 3
apple 13 345
apple 43 4772
banana 19 234 2
banana 73 3242
peach 131 53423 4
peach 234 3266
peach 242 324
peach 131 56758
peaches 29 2434 1
This does what you want in a single pass of the input file and only stores the values for 1 fruit at a time in memory so it shouldn't have performance or memory issues despite your file being too big for MS-Excel:
$ cat tst.awk
NR==1 { print $0, "fruit_count"; next }
$1 != prev { prt() }
{ arr[++cnt] = $0; prev = $1 }
END { prt() }
function prt( i) {
if (cnt) {
print arr[1], cnt
for (i=2; i <= cnt; i++) {
print arr[i]
}
delete arr
cnt = 0
}
}
$ awk -f tst.awk file | column -t
fruit number reference fruit_count
apple 12 342 3
apple 13 345
apple 43 4772
banana 19 234 2
banana 73 3242
peach 131 53423 4
peach 234 3266
peach 242 324
peach 131 56758
peaches 29 2434 1