How to use info on substring position from one fil

2019-09-16 18:54发布

问题:

I'm trying quite hard to write a script that "loopingly" extracts substrings from one file, while getting the information on where to cut from another file. I'm working in bash on MobaXterm. I have the file cut_positions.txt, which is tab delimited and shows name, start point, end point, length, comment:

k141_20066  103484  104617  1133    phnW  
k141_20841  13200   14324   1124    phnW  
k141_23852  69  452 383 phnW  
k141_32328  1   180 179 phnW 

and the string_file.txt with the name (it would be no problem to remove/add the ">" in one of the files) and the string (the original strings are way longer, up to 1.000.000 characters):

>k141_10671 CCTTCCCCCACACGCCGCTCTTCCGCTCTTGCTGGCC  
>k141_10707 AGGCGGTATCAGACCTTGCCGCAACACTAAGCCCAGTAACGCTGTCGCCCTTATATCTGA  
>k141_11190 CTTTTGTGACAGTGCAGGGCAATGGTGGATTTATCAGTATCGGGCAGAA  
>k141_1479  AGCCGACAGCAGCGCCGAGGGCACATAATCCGATGACACGATGTCCAAAAGATCCGCCTCGGC

Now I want to use the input from the cut_positions.txt. I want to use the first column to match the right line, then the second column as start point of the substring and the fourth column as length of the substring. This should be done with all lines in cut_positions.txt and written to a new out.txt. To get closer I tried (with my original data):

➤ grep ">k141_28027\b" test_out_one_line.txt | awk '{print substr($2,57251,69)}'
TCACTTGAGCGCAATTATTCGCTCTCCGGCGGCGTCAGCATCAGCCTGATCATGCGTCACCAAAAGTGT

which worked well as handmade way. I figured out as well how to access the different elements in cut_positions.txt (here the first row in the second column):

awk -F '\t' 'NR==1{print $2}' cut_positions.txt

but I can't figure out how to turn this into a loop, as I don't know how to connect the different redirections, piping steps and so on that I used for the small steps. Any help is very much appreciated (and tell me, if you need more sample data)

thanks crazysantaclaus

回答1:

The following script should work for you:

cut.awk

# We are reading two files: pos.txt and strings.txt
# NR is equal to FNR as long as we are reading the
# first file.
NR==FNR{
    pos[">"$1]=$2 # Store the startpoint in an array pos (indexed by $1)
    len[">"$1]=$4 # Store the length in an array len (indexed by $1)
    next # skip the block below for pos.txt
}

# This runs on every line of strings.txt
$1 in pos {
    # Extract a substring of $2 based on the position and length
    # stored above
    key=$1
    mod=substr($2,pos[key],len[key])
    $2=mod
    print # Print the modified line
}

Call it like this:

awk -f cut.awk pos.txt strings.txt

One important thing to mention. substr() assumes strings to start at index 1 - in opposite to most programming languages where strings start at index 0. If the positions in pos.txt are 0 based, the substr() must become:

mod=substr($2,pos[key]+1,len[key])

I recommend to test it with simplified, meaningful versions of:

pos.txt

foo  2  5  3    phnW  
bar  4  5  1    phnW
test 1  5  4    phnW

and strings.txt

>foo 123456  
>bar 123456
>non 123456

Output:

>foo 234
>bar 4