How can I delete lines which are substrings of other lines in a file while keeping the longer strings which include them?
I have a file that contain peptide sequences as strings - one sequence string per line. I want to keep the strings which contain all the sequences and remove all lines which are substrings of other lines in the file.
Input:
GSAAQQYW
ATFYGGSDASGT
GSAAQQYWTPANATFYGGSDASGT
GSAAQQYWTPANATF
ATFYGGSDASGT
NYARTTCRRTG
IVPVNYARTTCRRTGGIRFTITGHDYFDN
RFTITGHDYFDN
IVPVNYARTTCRRTG
ARTTCRRTGGIRFTITG
Expected Output:
GSAAQQYWTPANATFYGGSDASGT
IVPVNYARTTCRRTGGIRFTITGHDYFDN
The output should keep only longest strings and remove all lines which are substrings of the longest string. So, in the input above, lines 1,2,4 and 5 are substrings of line 3 so output retained line 3. Similarily for the strings on lines 6,8,9 and 10 all of which are substrings of line 7, thus line 7 is retained and written to output.
This should do what you want:
It loops through every string in arr and then uses that as the separator value for split() - if the string occurs once then the full file contents will be split in half and so split() would return 2 but if the string is a subset of some other string then the file contents would be split into multiple segments and so split would return some number higher than 2.
If a string can appear multiple times in the input and you want it printed multiple times in the output (see the question in the comment from @G.Cito below) then you'd modify the above to:
This will help you what you exactly need:
awk '{ print length(), NR, $0 | "sort -rn" }' sed_longer.txt | head -n 2
As a perl "one liner" (this should work for cutting and pasting into a terminal):
We read and chomp the file (
peptide_seq.txt
) from STDIN (<>
) and save it in@r
which will be an array in which each element is a string from each line in the file.Next we iterate through the array and
map
the elements of@r
to a hash (%uniq
) where eachkey
is the content of each line; and eachvalue
is a number that is incremented when a line is found to be a substring of another line. Usingindex
we can check whether a string contains a sub-string and increment the corresponding hash valueif
index()
does not return the value for "not found" (-1
).The "master" strings contain all the other strings as sub-strings of themselves and will only be incremented once, so we loop again to print the keys of the
%uniq
hash that have the value== 1
. This second loop could be amap
instead:map { say if ( $uniq{$_} == 1 ) } sort keys uniq ;
As a self-contained script that could be:
Output:
Maybe:
produces:
it is slow - but simple..