I just started learning python and here I have a sorted list of protein sequences (total 59,000 sequences) and some of them overlap. I have made a toy list here for example:
ABCDE
ABCDEFG
ABCDEFGH
ABCDEFGHIJKLMNO
CEST
DBTSFDE
DBTSFDEO
EOEUDNBNUW
EOEUDNBNUWD
EAEUDNBNUW
FEOEUDNBNUW
FG
FGH
I would like to remove those shorter overlap and just keep the longest one so the desired output would look like this:
ABCDEFGHIJKLMNO
CEST
DBTSFDEO
EAEUDNBNUW
FEOEUDNBNUWD
FGH
How can I do it? My code looks like this:
with open('toy.txt' ,'r') as f:
pattern = f.read().splitlines()
print pattern
for i in range(0, len(pattern)):
if pattern[i] in pattern[i+1]:
pattern.remove(pattern[i])
print pattern
And I got the error message:
['ABCDE', 'ABCDEFG', 'ABCDEFGH', 'ABCDEFGHIJKLMNO', 'CEST', 'DBTSFDE', 'DBTSFDEO', 'EOEUDNBNUW', 'EAEUDNBNUW', 'FG', 'FGH']
['ABCDEFG', 'ABCDEFGH', 'ABCDEFGHIJKLMNO', 'CEST', 'DBTSFDE', 'DBTSFDEO', 'EOEUDNBNUW', 'EAEUDNBNUW', 'FG', 'FGH']
['ABCDEFG', 'ABCDEFGHIJKLMNO', 'CEST', 'DBTSFDE', 'DBTSFDEO', 'EOEUDNBNUW', 'EAEUDNBNUW', 'FG', 'FGH']
['ABCDEFG', 'ABCDEFGHIJKLMNO', 'CEST', 'DBTSFDE', 'DBTSFDEO', 'EOEUDNBNUW', 'EAEUDNBNUW', 'FG', 'FGH']
['ABCDEFG', 'ABCDEFGHIJKLMNO', 'CEST', 'DBTSFDEO', 'EOEUDNBNUW', 'EAEUDNBNUW', 'FG', 'FGH']
['ABCDEFG', 'ABCDEFGHIJKLMNO', 'CEST', 'DBTSFDEO', 'EOEUDNBNUW', 'EAEUDNBNUW', 'FG', 'FGH']
['ABCDEFG', 'ABCDEFGHIJKLMNO', 'CEST', 'DBTSFDEO', 'EOEUDNBNUW', 'EAEUDNBNUW', 'FG', 'FGH']
['ABCDEFG', 'ABCDEFGHIJKLMNO', 'CEST', 'DBTSFDEO', 'EOEUDNBNUW', 'EAEUDNBNUW', 'FGH']
Traceback (most recent call last):
File "test.py", line 8, in <module>
if pattern[i] in pattern[i+1]:
IndexError: list index out of range
Code
Output
Details
We use
read_file
to yield each line of the file.find_longest_sequences
builds a defaultdict that groups similar sequences together. It iterates the data with two loops:A set of the values is made of the resulting dict, and the longest sequences are returned.
Note some discrepancies with your expected output:
FGH
overlaps withABCDEFGHIJKLMNO
and is thus not a valid output.FEOEUDNBNUWD
is not an original sequence. Post-processing is needed for overlapping sequences.Not an exact match with your expectations, but, given that you state it's sorted (and it's not, near
EOEUDNBNUWD EAEUDNBNUW
) and that I don't know why you're missingEOEUDNBNUWD
I am not sure if your expectations are correctly stated or if I've misread your question.(ah, yes, I see the notion of overlap throws a wrench into the
sort
andstartswith
approach).Might be nice for the OP to restate that particular aspect, I read @DSM comment without really understanding his concern. Now I do.
output:
This will get you where you want to be:
I've added
set
just in case of multiple occurrences of same text.You could use
groupby()
andmax()
to help here:This would display:
groupby()
works by returning a list of matching items based on a function, in this case consecutive lines with the same first 2 characters. Themax()
function then takes this list and returns the list item with the longest length.Kenny, You almost got it, but there are two problems which @scharette pointed out:
for
loop and removing of list item should not go together. The fix is to use thewhile
loop and explicitly increase the index. Thewhile
loop is less efficient because it callslen()
several times instead once, but that's what it take to get the correct result.IndexError
. This only happens at the very last line. My way to deal with this problem is to ignore the error.With that, I modified your code to:
There is other working answers, but none of them explain your actual problem. you were actually really close of a valid solution and what is, in my opinion, the most readable answer.
The error came from the fact that you were mutating the same list while checking for index using
range()
.Thus, while increasing the
i
variable you were removing item from the list which at one point causes theindex error
inevitably.Therefore, here is a working version of your initial code with some changes,
Note that this code will work if your list is previously sorted as you mentioned in comment section.
What is this code doing ?
Basically, it use the same logic of your initial answer where it iterates on the list and check if the next item contains the current item. But, using another list and iterating until the before last item, will fix your index problem. But now comes a question,
What should I do with the last item ?
Since the list is sorted, you can consider the last item as always being unique. This is why I'm using
which adds the last item of the initial list.
Important note
This answer was written in response to OP's initial question where he wanted to keep the longer overlap and I quote based on the next item in same list. As stated by @Chris_Rands if your concerns are related to a biological task and need to find any overlap, this solution is not suited for your needs.
Example where this code would fail to recognize a potential overlap,
where it would output the same result without removing the possible
"ACD"
overlap. Now, just as a clarification though, this would imply a much more complex algorithm and I initially thought it was out of the scope of the question's requirements. If ever this is your case, I may be completely wrong here, but I truly think a C++ implementation seems more appropriate. have a look at the CD-Hit algorithm suggested by @Chris_Rands in the comment section.