Remove lines which are substrings of other lines

2019-05-28 18:43发布

问题:

How can I delete lines which are substrings of other lines in a file while keeping the longer strings which include them?

I have a file that contain peptide sequences as strings - one sequence string per line. I want to keep the strings which contain all the sequences and remove all lines which are substrings of other lines in the file.

Input:

GSAAQQYW
ATFYGGSDASGT
GSAAQQYWTPANATFYGGSDASGT
GSAAQQYWTPANATF
ATFYGGSDASGT
NYARTTCRRTG
IVPVNYARTTCRRTGGIRFTITGHDYFDN
RFTITGHDYFDN
IVPVNYARTTCRRTG
ARTTCRRTGGIRFTITG

Expected Output:

GSAAQQYWTPANATFYGGSDASGT
IVPVNYARTTCRRTGGIRFTITGHDYFDN

The output should keep only longest strings and remove all lines which are substrings of the longest string. So, in the input above, lines 1,2,4 and 5 are substrings of line 3 so output retained line 3. Similarily for the strings on lines 6,8,9 and 10 all of which are substrings of line 7, thus line 7 is retained and written to output.

回答1:

This should do what you want:

$ cat tst.awk
{ arr[$0]; strs=strs $0 RS }
END {
    for (str in arr) {
        if ( split(strs,tmp,str) == 2 ) {
            print str
        }
    }
}

$ awk -f tst.awk file
IVPVNYARTTCRRTGGIRFTITGHDYFDN
GSAAQQYWTPANATFYGGSDASGT

It loops through every string in arr and then uses that as the separator value for split() - if the string occurs once then the full file contents will be split in half and so split() would return 2 but if the string is a subset of some other string then the file contents would be split into multiple segments and so split would return some number higher than 2.

If a string can appear multiple times in the input and you want it printed multiple times in the output (see the question in the comment from @G.Cito below) then you'd modify the above to:

!cnt[$0]++ { strs=strs $0 RS }
END {
    for (str in cnt) {
        if ( split(strs,tmp,str) == 2 ) {
            for (i=1;i<=cnt[str];i++) {
                print str
            }
        }
    }
}


回答2:

Maybe:

input=./input_file
while read -r str
do
[[ $(grep -c "$str" "$input") == 1 ]] && echo $str
done < "$input"

produces:

GSAAQQYWTPANATFYGGSDASGT
IVPVNYARTTCRRTGGIRFTITGHDYFDN

it is slow - but simple..



回答3:

As a perl "one liner" (this should work for cutting and pasting into a terminal):

perl -E 'chomp(@r=<>); 
        for $i (0..$#r){ 
           map { $uniq{$_}++ if ( index( $r[$i], $_ ) != -1 ) } @r; 
        }
        for (sort keys %uniq){ say if ( $uniq{$_} == 1 ); }' peptide_seq.txt
  • We read and chomp the file (peptide_seq.txt) from STDIN (<>) and save it in @r which will be an array in which each element is a string from each line in the file.

  • Next we iterate through the array and map the elements of @r to a hash (%uniq) where each key is the content of each line; and each value is a number that is incremented when a line is found to be a substring of another line. Using index we can check whether a string contains a sub-string and increment the corresponding hash value if index() does not return the value for "not found" (-1).

  • The "master" strings contain all the other strings as sub-strings of themselves and will only be incremented once, so we loop again to print the keys of the %uniq hash that have the value == 1. This second loop could be a map instead:

    map { say if ( $uniq{$_} == 1 ) } sort keys uniq ;

As a self-contained script that could be:

#!perl -l
chomp(@r=<DATA>); 

for $i (0..$#r) {
  map { $uniq{$_}++ if ( index( $r[$i], $_ ) != -1 ) } @r ;
}

map { print if ($uniq{$_} == 1) } sort keys %uniq ; 

__DATA__
GSAAQQYW
ATFYGGSDASGT
GSAAQQYWTPANATFYGGSDASGT
GSAAQQYWTPANATF
ATFYGGSDASGT
NYARTTCRRTG
IVPVNYARTTCRRTGGIRFTITGHDYFDN
RFTITGHDYFDN
IVPVNYARTTCRRTG
ARTTCRRTGGIRFTITG

Output:

GSAAQQYWTPANATFYGGSDASGT
IVPVNYARTTCRRTGGIRFTITGHDYFDN


回答4:

This will help you what you exactly need:

awk '{ print length(), NR, $0 | "sort -rn" }' sed_longer.txt | head -n 2