I have at hand a file looking like this (delimited by tabs, 2 fields):
denovo0 90.2
denovo1 97.7
denovo1 97.7
denovo1 96.9
denovo10 93.8
denovo10 92.2
denovo10 91.5
denovo100 95.3
denovo100 95.3
denovo100 94.6
And I would like to retain only unique strings in the first field that have the lowest value in the second column to have:
denovo0 90.2
denovo1 96.9
denovo10 91.5
denovo100 94.6
As is can be seen in the upper example, some rows in the file may be complete duplicates of other rows, I am not sure how that would influence solutions.
I have looked up similar solutions on StackOverflow, e.g: Uniq in awk; removing duplicate values in a column using awk , but was not able to adopt them.
I would be happy if someone could help.
I'd prefer using AWK but BASH would also be an option. I am working with MacOSX Yosemite.
I would be really happy if someone could help out.
Thank you and kind regards,
Paul
You can get the results you show with:
awk '{if (!($1 in a)) a[$1] = $2} END { for (key in a) print key, a[key] }'
Output:
denovo0 90.2
denovo1 97.7
denovo10 93.8
denovo100 95.3
For the results described (minimum value in column 2 for each key in column 1), you can use:
awk '{ if (!($1 in a)) a[$1] = $2; else if (a[$1] > $2) a[$1] = $2 }
END { for (key in a) print key, a[key] }'
Output:
denovo0 90.2
denovo1 96.9
denovo10 91.5
denovo100 94.6
You can also get the sample output in the question by looking for the maximum value for each key; it so happens that the maximum value is also the first for each key in the sample data.
awk '{ if (!($1 in a)) a[$1] = $2; else if (a[$1] < $2) a[$1] = $2 }
END { for (key in a) print key, a[key] }'
Output:
denovo0 90.2
denovo1 97.7
denovo10 93.8
denovo100 95.3
I assume that you intend to have the output show the lowest value associated with each string that appears in the first column. I also assume that the input is sorted in the first column (such as is shown in your example).
This approach keeps the output in the same order as it appeared in the input:
$ awk '$1==last{if ($2<min)min=$2;next} last{print last"\t"min} {last=$1;min=$2} END{print last"\t"min}' file
denovo0 90.2
denovo1 96.9
denovo10 91.5
denovo100 94.6
This will work with large data sets because it does not require all the output to be kept in memory at once. However, if your input file is not sorted on the first column, use Jonathan Leffler's approach.
This should be simple using awk like:
awk '{if(!($1 in a)){print; a[$1]++}}' myfile
if the first field ($1) is not in the a array, print the line and add the 1st field to a. Next time we see that field, it will be in the array and so will not be printed.
Even simpler solution using unix pipe.
sort -k1,1 -k2,2n inputFile | awk '!seen[$1]++' > outputFile