awk/bash remove lines with an unique id and keep t

2019-09-15 08:18发布

If we have the following input and would like to firstly, detect if the cpd_number ($2) is unique in the file, remove the whole row. In this case, the line with "cpd-6666666" should be removed.
Secondly, if there are multiple lines kept under the same "cpd_number", only prints out the two lines which has the max and min "log_ratio" ($17).

targetID,cpd_number,Cell_assay_id,Cell_alt_assay_id,Cell_type_desc,Cell_Operator,Cell_result_value,Cell_unit_value,assay_id,alt_assay_id,type_desc,operator,result_value,unit_value,Ratio_operator,Ratio,log_ratio,Cell_experiment_date,experiment_date,Cell_discipline,discipline
49,cpd-7788990,1212,2323, IC50 ,,100,uM,1334,1331,Ki,,10,uM,,10,-1,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
49,cpd-7788990,5555,6666, IC50 ,>,150,uM,1334,1331,Ki,,10,uM,>,15,-1.176091259,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
49,cpd-7788990,8888,9999, IC50 ,,200,uM,1334,1331,Ki,,10,uM,,20,-1.301029996,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
49,cpd-6666666,8888,9999, IC50 ,,400,uM,1334,1331,Ki,,10,uM,,40,-1.602059991,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme

The ideal output should be

targetID,cpd_number,Cell_assay_id,Cell_alt_assay_id,Cell_type_desc,Cell_Operator,Cell_result_value,Cell_unit_value,assay_id,alt_assay_id,type_desc,operator,result_value,unit_value,Ratio_operator,Ratio,log_ratio,Cell_experiment_date,experiment_date,Cell_discipline,discipline
49,cpd-7788990,1212,2323, IC50 ,,100,uM,1334,1331,Ki,,10,uM,,10,-1,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
49,cpd-7788990,8888,9999, IC50 ,,200,uM,1334,1331,Ki,,10,uM,,20,-1.301029996,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme

I was trying to use awk count function but it doesn't seem to work quite well. Could any guru kindly give some comments? Thanks!

2条回答
唯我独甜
2楼-- · 2019-09-15 08:24

While not as concise as the perl answer, here's a executable awk file:

#!/usr/bin/awk -f

BEGIN { FS="," }

NR==1 {print; next}

{
  a[$2,$17]=$0

  h=high[$2]
  high[$2]=$17>h || h=="" ? $17 : h

  l=low[$2]
  low[$2]=$17<l || l=="" ? $17 : l
}

END {
  for(i in high) {
    if(low[i]!=high[i]) {
      print a[i,high[i]]
      print a[i,low[i]]
    }
  }
}

which:

  • Prints the header row
  • Stores whole lines in a, and highs and lows for each key
  • In the END, walks the high array and prints lines where the high and low are different by retrieving them from a
查看更多
甜甜的少女心
3楼-- · 2019-09-15 08:31

It will be easier if you can give a try with perl:

perl -MList::Util=max,min -F, -lane '
    print if $.==1;
    $line{$F[1]}{$F[16]}=$_; 
    $count{$F[1]}++ 
}{
    for $key (keys %count) {
        next if $count{$key}==1; 
        print $line{$key}{ max keys %{$line{$key}} };
        print $line{$key}{ min keys %{$line{$key}} }
     }
' file

Output:

targetID,cpd_number,Cell_assay_id,Cell_alt_assay_id,Cell_type_desc,Cell_Operator,Cell_result_value,Cell_unit_value,assay_id,alt_assay_id,type_desc,operator,result_value,unit_value,Ratio_operator,Ratio,log_ratio,Cell_experiment_date,experiment_date,Cell_discipline,discipline
49,cpd-7788990,1212,2323, IC50 ,,100,uM,1334,1331,Ki,,10,uM,,10,-1,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
49,cpd-7788990,8888,9999, IC50 ,,200,uM,1334,1331,Ki,,10,uM,,20,-1.301029996,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme

Explanation:

  • Use a core module (comes with every perl binary) to find min and max values of $17.
  • Using -F split the lines on ,.
  • -a option splits the line on delimiter in to an @F array.
  • If it is our first line, we print it
  • We create a hash of hash with second column as key and 17th column as second tier key. The value is entire line
  • We create a counter hash to keep counts of second column
  • In the END block we iterate through our keys from counter hash. If the count is one we skip it.
  • If the counter is greater than one we print the min and max of 17th column.
查看更多
登录 后发表回答