AWK work wit vcf (text) file

2019-09-13 02:42发布

问题:

I would like to create awk code, which will modifie text like this:

  1. Tab delimited all columns
  2. Delete all columns which is starting by "##text"
  3. And keep headers, which starts "#header"

I have this code, but it is not good:

#!/bin/bash
for i
in *.vcf;
do
    awk 'BEGIN {print  "CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILT\tINFO\tFORMAT"}' |
    awk '{$1 "\t" $2 "\t" $3 "\t" $4 "\t" $5 "\t" $6 "\t" $7 "\t" $8 "\t" $9}' $i |
    awk '!/#/' > ${i%.vcf}.tsv;
done

INPUT:

> ##fileformat=VCFv4.1
> ##FORMAT=<ID=GQX,Number=1,Type=Integer,Description="Minimum of {Genotype quality assuming variant position,Genotype quality assuming
> non-variant position}">
> #CHROM    POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  1 chr1  10385471    rs17401966  A   G   100.00  PASS    DP=67;TI=NM_015074;GI=KIF1B;FC=Silent   GT:GQ:AD:VF:NL:SB:GQX   0/1:100:29,38:0.5672:20:-100.0000:100
> chr1  17380497    rs2746462   G   T   100.00  PASS    DP=107;TI=NM_003000;GI=SDHB;FC=Synonymous_A6A;EXON  GT:GQ:AD:VF:NL:SB:GQX   1/1:100:0,107:1.0000:20:-100.0000:100
> chr1  222045446   rs6691170   G   T   100.00  PASS    DP=99   GT:GQ:AD:VF:NL:SB:GQX   0/1:100:49,50:0.5051:20:-100.0000:100

OUTPUT: What I want

> CHROM POS   ID          REF  ALT  QUAL    FILTER  INFO             etc...
> hr1   10385471  rs17401966  A   
> G 100.00  PASS    DP=67;TI=NM_015074;GI=KIF1B;

回答1:

You want to put your whole program in a single awk call:

for f in *.vcf; do
    awk '
        BEGIN {OFS = "\t"}
        /^##/ {next}
        /^#/ {sub(/^#/,"",$1)}
        {$1=$1; print}
    ' "$f" > "${f/%vcf/tsv}"
done

This program will skip any record that begins with ##, will remove the leading hash for lines that have it, and then print each line using tab as the field separator.

awk programs are a series of condition {action} pairs. For each record in the input, if the condition is true, the action block is performed, otherwise it is ignored. If the condition is omitted, the action block is performed unconditionally.

One tricky bit in this example is $1=$1 -- when fields are modified, awk will re-build the record, joining the fields using the output field separator (OFS variable).