awk/bash remove lines with an unique id and keep t

If we have the following input and would like to firstly, detect if the cpd_number ($2) is unique in the file, remove the whole row. In this case, the line with "cpd-6666666" should be removed.
Secondly, if there are multiple lines kept under the same "cpd_number", only prints out the two lines which has the max and min "log_ratio" ($17).

targetID,cpd_number,Cell_assay_id,Cell_alt_assay_id,Cell_type_desc,Cell_Operator,Cell_result_value,Cell_unit_value,assay_id,alt_assay_id,type_desc,operator,result_value,unit_value,Ratio_operator,Ratio,log_ratio,Cell_experiment_date,experiment_date,Cell_discipline,discipline
49,cpd-7788990,1212,2323, IC50 ,,100,uM,1334,1331,Ki,,10,uM,,10,-1,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
49,cpd-7788990,5555,6666, IC50 ,>,150,uM,1334,1331,Ki,,10,uM,>,15,-1.176091259,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
49,cpd-7788990,8888,9999, IC50 ,,200,uM,1334,1331,Ki,,10,uM,,20,-1.301029996,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
49,cpd-6666666,8888,9999, IC50 ,,400,uM,1334,1331,Ki,,10,uM,,40,-1.602059991,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme

The ideal output should be

targetID,cpd_number,Cell_assay_id,Cell_alt_assay_id,Cell_type_desc,Cell_Operator,Cell_result_value,Cell_unit_value,assay_id,alt_assay_id,type_desc,operator,result_value,unit_value,Ratio_operator,Ratio,log_ratio,Cell_experiment_date,experiment_date,Cell_discipline,discipline
49,cpd-7788990,1212,2323, IC50 ,,100,uM,1334,1331,Ki,,10,uM,,10,-1,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
49,cpd-7788990,8888,9999, IC50 ,,200,uM,1334,1331,Ki,,10,uM,,20,-1.301029996,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme

I was trying to use awk count function but it doesn't seem to work quite well. Could any guru kindly give some comments? Thanks!

标签： bash awk count max min

2条回答

唯我独甜

2楼-- · 2019-09-15 08:24

While not as concise as the perl answer, here's a executable awk file:

#!/usr/bin/awk -f

BEGIN { FS="," }

NR==1 {print; next}

{
  a[$2,$17]=$0

  h=high[$2]
  high[$2]=$17>h || h=="" ? $17 : h

  l=low[$2]
  low[$2]=$17<l || l=="" ? $17 : l
}

END {
  for(i in high) {
    if(low[i]!=high[i]) {
      print a[i,high[i]]
      print a[i,low[i]]
    }
  }
}

which:

Prints the header row
Stores whole lines in a, and highs and lows for each key
In the END, walks the high array and prints lines where the high and low are different by retrieving them from a

0人赞添加讨论(0) 举报

甜甜的少女心

3楼-- · 2019-09-15 08:31

It will be easier if you can give a try with perl:

perl -MList::Util=max,min -F, -lane '
    print if $.==1;
    $line{$F[1]}{$F[16]}=$_; 
    $count{$F[1]}++ 
}{
    for $key (keys %count) {
        next if $count{$key}==1; 
        print $line{$key}{ max keys %{$line{$key}} };
        print $line{$key}{ min keys %{$line{$key}} }
     }
' file

Output:

targetID,cpd_number,Cell_assay_id,Cell_alt_assay_id,Cell_type_desc,Cell_Operator,Cell_result_value,Cell_unit_value,assay_id,alt_assay_id,type_desc,operator,result_value,unit_value,Ratio_operator,Ratio,log_ratio,Cell_experiment_date,experiment_date,Cell_discipline,discipline
49,cpd-7788990,1212,2323, IC50 ,,100,uM,1334,1331,Ki,,10,uM,,10,-1,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
49,cpd-7788990,8888,9999, IC50 ,,200,uM,1334,1331,Ki,,10,uM,,20,-1.301029996,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme

Explanation:

Use a core module (comes with every perl binary) to find min and max values of $17.
Using -F split the lines on ,.
-a option splits the line on delimiter in to an @F array.
If it is our first line, we print it
We create a hash of hash with second column as key and 17th column as second tier key. The value is entire line
We create a counter hash to keep counts of second column
In the END block we iterate through our keys from counter hash. If the count is one we skip it.
If the counter is greater than one we print the min and max of 17th column.

0人赞添加讨论(0) 举报