可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I have at hand a file looking like this (delimited by tabs, 2 fields):

    denovo0  90.2
    denovo1  97.7
    denovo1  97.7
    denovo1  96.9
    denovo10     93.8
    denovo10     92.2
    denovo10     91.5
    denovo100    95.3
    denovo100    95.3
    denovo100    94.6

And I would like to retain only unique strings in the first field that have the lowest value in the second column to have:

    denovo0  90.2
    denovo1  96.9
    denovo10     91.5
    denovo100    94.6

As is can be seen in the upper example, some rows in the file may be complete duplicates of other rows, I am not sure how that would influence solutions.

I have looked up similar solutions on StackOverflow, e.g: Uniq in awk; removing duplicate values in a column using awk , but was not able to adopt them.

I would be happy if someone could help.

I'd prefer using AWK but BASH would also be an option. I am working with MacOSX Yosemite.

I would be really happy if someone could help out.

Thank you and kind regards,

Paul

回答1:

You can get the results you show with:

awk '{if (!($1 in a)) a[$1] = $2} END { for (key in a) print key, a[key] }'

Output:

denovo0 90.2
denovo1 97.7
denovo10 93.8
denovo100 95.3

For the results described (minimum value in column 2 for each key in column 1), you can use:

awk '{ if (!($1 in a)) a[$1] = $2; else if (a[$1] > $2) a[$1] = $2 }
     END { for (key in a) print key, a[key] }'

Output:

denovo0 90.2
denovo1 96.9
denovo10 91.5
denovo100 94.6

You can also get the sample output in the question by looking for the maximum value for each key; it so happens that the maximum value is also the first for each key in the sample data.

awk '{ if (!($1 in a)) a[$1] = $2; else if (a[$1] < $2) a[$1] = $2 }
     END { for (key in a) print key, a[key] }'

Output:

denovo0 90.2
denovo1 97.7
denovo10 93.8
denovo100 95.3

回答2:

I assume that you intend to have the output show the lowest value associated with each string that appears in the first column. I also assume that the input is sorted in the first column (such as is shown in your example).

This approach keeps the output in the same order as it appeared in the input:

$ awk '$1==last{if ($2<min)min=$2;next} last{print last"\t"min} {last=$1;min=$2} END{print last"\t"min}' file
denovo0 90.2
denovo1 96.9
denovo10        91.5
denovo100       94.6

This will work with large data sets because it does not require all the output to be kept in memory at once. However, if your input file is not sorted on the first column, use Jonathan Leffler's approach.

回答3:

This should be simple using awk like:

 awk '{if(!($1 in a)){print; a[$1]++}}' myfile

if the first field ($1) is not in the a array, print the line and add the 1st field to a. Next time we see that field, it will be in the array and so will not be printed.

回答4:

Even simpler solution using unix pipe.

sort -k1,1 -k2,2n inputFile | awk '!seen[$1]++' > outputFile

Deduplicate two column file based on minimum value

问题:

回答1:

回答2:

回答3:

回答4:

收藏的人(0)

Deduplicate two column file based on minimum value

问题:

回答1:

回答2:

回答3:

回答4:

收藏的人(0)

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮