How can I delete lines which are substrings of other lines in a file while keeping the longer strings which include them?
I have a file that contain peptide sequences as strings - one sequence string per line. I want to keep the strings which contain all the sequences and remove all lines which are substrings of other lines in the file.
Input:
GSAAQQYW
ATFYGGSDASGT
GSAAQQYWTPANATFYGGSDASGT
GSAAQQYWTPANATF
ATFYGGSDASGT
NYARTTCRRTG
IVPVNYARTTCRRTGGIRFTITGHDYFDN
RFTITGHDYFDN
IVPVNYARTTCRRTG
ARTTCRRTGGIRFTITG
Expected Output:
GSAAQQYWTPANATFYGGSDASGT
IVPVNYARTTCRRTGGIRFTITGHDYFDN
The output should keep only longest strings and remove all lines which are substrings of the longest string. So, in the input above, lines 1,2,4 and 5 are substrings of line 3 so output retained line 3. Similarily for the strings on lines 6,8,9 and 10 all of which are substrings of line 7, thus line 7 is retained and written to output.
This should do what you want:
$ cat tst.awk
{ arr[$0]; strs=strs $0 RS }
END {
for (str in arr) {
if ( split(strs,tmp,str) == 2 ) {
print str
}
}
}
$ awk -f tst.awk file
IVPVNYARTTCRRTGGIRFTITGHDYFDN
GSAAQQYWTPANATFYGGSDASGT
It loops through every string in arr and then uses that as the separator value for split() - if the string occurs once then the full file contents will be split in half and so split() would return 2 but if the string is a subset of some other string then the file contents would be split into multiple segments and so split would return some number higher than 2.
If a string can appear multiple times in the input and you want it printed multiple times in the output (see the question in the comment from @G.Cito below) then you'd modify the above to:
!cnt[$0]++ { strs=strs $0 RS }
END {
for (str in cnt) {
if ( split(strs,tmp,str) == 2 ) {
for (i=1;i<=cnt[str];i++) {
print str
}
}
}
}
Maybe:
input=./input_file
while read -r str
do
[[ $(grep -c "$str" "$input") == 1 ]] && echo $str
done < "$input"
produces:
GSAAQQYWTPANATFYGGSDASGT
IVPVNYARTTCRRTGGIRFTITGHDYFDN
it is slow - but simple..
As a perl "one liner" (this should work for cutting and pasting into a terminal):
perl -E 'chomp(@r=<>);
for $i (0..$#r){
map { $uniq{$_}++ if ( index( $r[$i], $_ ) != -1 ) } @r;
}
for (sort keys %uniq){ say if ( $uniq{$_} == 1 ); }' peptide_seq.txt
We read and chomp the file (peptide_seq.txt
) from STDIN (<>
) and save it in @r
which will be an array in which each element is a string from each line in the file.
Next we iterate through the array and map
the elements of @r
to a hash (%uniq
) where each key
is the content of each line; and each value
is a number that is incremented when a line is found to be a substring of another line. Using index
we can check whether a string contains a sub-string and increment the corresponding hash value if
index()
does not return the value for "not found" (-1
).
The "master" strings contain all the other strings as sub-strings of themselves and will only be incremented once, so we loop again to print the keys of the %uniq
hash that have the value == 1
. This second loop could be a map
instead:
map { say if ( $uniq{$_} == 1 ) } sort keys uniq ;
As a self-contained script that could be:
#!perl -l
chomp(@r=<DATA>);
for $i (0..$#r) {
map { $uniq{$_}++ if ( index( $r[$i], $_ ) != -1 ) } @r ;
}
map { print if ($uniq{$_} == 1) } sort keys %uniq ;
__DATA__
GSAAQQYW
ATFYGGSDASGT
GSAAQQYWTPANATFYGGSDASGT
GSAAQQYWTPANATF
ATFYGGSDASGT
NYARTTCRRTG
IVPVNYARTTCRRTGGIRFTITGHDYFDN
RFTITGHDYFDN
IVPVNYARTTCRRTG
ARTTCRRTGGIRFTITG
Output:
GSAAQQYWTPANATFYGGSDASGT
IVPVNYARTTCRRTGGIRFTITGHDYFDN
This will help you what you exactly need:
awk '{ print length(), NR, $0 | "sort -rn" }' sed_longer.txt | head -n 2