I just started learning python and here I have a sorted list of protein sequences (total 59,000 sequences) and some of them overlap. I have made a toy list here for example:
ABCDE
ABCDEFG
ABCDEFGH
ABCDEFGHIJKLMNO
CEST
DBTSFDE
DBTSFDEO
EOEUDNBNUW
EOEUDNBNUWD
EAEUDNBNUW
FEOEUDNBNUW
FG
FGH
I would like to remove those shorter overlap and just keep the longest one so the desired output would look like this:
ABCDEFGHIJKLMNO
CEST
DBTSFDEO
EAEUDNBNUW
FEOEUDNBNUWD
FGH
How can I do it? My code looks like this:
with open('toy.txt' ,'r') as f:
pattern = f.read().splitlines()
print pattern
for i in range(0, len(pattern)):
if pattern[i] in pattern[i+1]:
pattern.remove(pattern[i])
print pattern
And I got the error message:
['ABCDE', 'ABCDEFG', 'ABCDEFGH', 'ABCDEFGHIJKLMNO', 'CEST', 'DBTSFDE', 'DBTSFDEO', 'EOEUDNBNUW', 'EAEUDNBNUW', 'FG', 'FGH']
['ABCDEFG', 'ABCDEFGH', 'ABCDEFGHIJKLMNO', 'CEST', 'DBTSFDE', 'DBTSFDEO', 'EOEUDNBNUW', 'EAEUDNBNUW', 'FG', 'FGH']
['ABCDEFG', 'ABCDEFGHIJKLMNO', 'CEST', 'DBTSFDE', 'DBTSFDEO', 'EOEUDNBNUW', 'EAEUDNBNUW', 'FG', 'FGH']
['ABCDEFG', 'ABCDEFGHIJKLMNO', 'CEST', 'DBTSFDE', 'DBTSFDEO', 'EOEUDNBNUW', 'EAEUDNBNUW', 'FG', 'FGH']
['ABCDEFG', 'ABCDEFGHIJKLMNO', 'CEST', 'DBTSFDEO', 'EOEUDNBNUW', 'EAEUDNBNUW', 'FG', 'FGH']
['ABCDEFG', 'ABCDEFGHIJKLMNO', 'CEST', 'DBTSFDEO', 'EOEUDNBNUW', 'EAEUDNBNUW', 'FG', 'FGH']
['ABCDEFG', 'ABCDEFGHIJKLMNO', 'CEST', 'DBTSFDEO', 'EOEUDNBNUW', 'EAEUDNBNUW', 'FG', 'FGH']
['ABCDEFG', 'ABCDEFGHIJKLMNO', 'CEST', 'DBTSFDEO', 'EOEUDNBNUW', 'EAEUDNBNUW', 'FGH']
Traceback (most recent call last):
File "test.py", line 8, in <module>
if pattern[i] in pattern[i+1]:
IndexError: list index out of range
Output:
You can use a binary tree whose insertion process attempts to find nodes that precede the value:
Output:
A simple way is to process the input file one line at a time, compare each line with the previous one and keep previous one if it is not contained in current one.
Code can be as simple as:
As stated in other answers, your error comes from calculating the length of your input at the start and then not updating it as you shorten the list.
Here's another take at a working solution:
Output