Mpileup regex command to remove indels

2019-09-02 04:06发布

问题:

I am trying to filter out insertions and deletions from an mpileup txt file. An example of an insertion or deletion would be +3ATG or -9AATCGTCTC.

In another post I found a solution using perl:

regular expression that reference a match from earlier part of expression

However, the script writes insertions and deletions to the special variable $&. I would like to replace all insertions and deletions with nothing in a new variable. So my solution is identical, but with substitution at the start and to be replaced with nothing, see below.

$row =~ s/(\d+)(??{"."*$1})//xg;

Does anyone have any idea why it won't work or an alternative solution?

I would also be happy to match anything that wasn't an insertion or deletion and make this a new variable.


Here is an example of the input:

$,...........................,,.................,,....,,g.,,,,,..,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,.,...............,,,.....,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,.....,,.....,,,,,,,,,,,......,,,,,,,,,,,,,,,,,,,,,,,,,,.,,.,,,.............................,,.,.........,.,.,,....,..........,,......................,,,,,,...........................,,,,,,,,.....,..,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,.,,,,,,,,,,,,,,,,,,,,.+12GATGCTGTGTTT..,,,,,,,,.,,,,,,,,,,,,,,,,,,,,,,,.,,.,,-8tgatgctg,,,...,,..,,,,,,,,,,,,,,,,,,,,,,,,,,,,..

Here is an example of the output I would like:

$,...........................,,.................,,....,,g.,,,,,..,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,.,...............,,,.....,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,.....,,.....,,,,,,,,,,,......,,,,,,,,,,,,,,,,,,,,,,,,,,.,,.,,,.............................,,.,.........,.,.,,....,..........,,......................,,,,,,...........................,,,,,,,,.....,..,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,.,,,,,,,,,,,,,,,,,,,,.+..,,,,,,,,.,,,,,,,,,,,,,,,,,,,,,,,.,,.,,-,,,...,,..,,,,,,,,,,,,,,,,,,,,,,,,,,,,..

Cheers,

Daniel

回答1:

Is this what you're after?

use feature qw(say);

my $DNA = ',...........,,....,,g.,,,,,,,,,,,.+12GATGCTGTGTTT..,,,,,.,,.,,-8tgatgctg,,,,,,,,..';

say $DNA;

$DNA =~ s/\d+[ATGCatgc]*//g;

say $DNA;

,...........,,....,,g.,,,,,,,,,,,.+12GATGCTGTGTTT..,,,,,.,,.,,-8tgatgctg,,,,,,,,..
,...........,,....,,g.,,,,,,,,,,,.+..,,,,,.,,.,,-,,,,,,,,..


回答2:

A slight variation on the pattern you already have should work:

$pileup = '$,...........................,,.................,,....,,g.,,,,,..,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,.,...............,,,.....,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,.....,,.....,,,,,,,,,,,......,,,,,,,,,,,,,,,,,,,,,,,,,,.,,.,,,.............................,,.,.........,.,.,,....,..........,,......................,,,,,,...........................,,,,,,,,.....,..,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,.,,,,,,,,,,,,,,,,,,,,.+12GATGCTGTGTTT..,,,,,,,,.,,,,,,,,,,,,,,,,,,,,,,,.,,.,,-8tgatgctg,,,...,,..,,,,,,,,,,,,,,,,,,,,,,,,,,,,..';

$pileup =~ s/[+-](\d+)(??{"[ACGTN]{$1}"})//gi;

print($pileup, "\n");

PRODUCES

$,...........................,,.................,,....,,g.,,,,,..,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,.,...............,,,.....,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,.....,,.....,,,,,,,,,,,......,,,,,,,,,,,,,,,,,,,,,,,,,,.,,.,,,.............................,,.,.........,.,.,,....,..........,,......................,,,,,,...........................,,,,,,,,.....,..,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,.,,,,,,,,,,,,,,,,,,,,...,,,,,,,,.,,,,,,,,,,,,,,,,,,,,,,,.,,.,,,,,...,,..,,,,,,,,,,,,,,,,,,,,,,,,,,,,..

Which you'll notice is a couple of characters shorter than your example output as you accidentally left in the signs [+-]