Remove item from list based on the next item in sa-第2页回答

I just started learning python and here I have a sorted list of protein sequences (total 59,000 sequences) and some of them overlap. I have made a toy list here for example:

ABCDE
ABCDEFG
ABCDEFGH
ABCDEFGHIJKLMNO
CEST
DBTSFDE
DBTSFDEO
EOEUDNBNUW
EOEUDNBNUWD
EAEUDNBNUW
FEOEUDNBNUW
FG
FGH

I would like to remove those shorter overlap and just keep the longest one so the desired output would look like this:

ABCDEFGHIJKLMNO
CEST
DBTSFDEO
EAEUDNBNUW
FEOEUDNBNUWD
FGH

How can I do it? My code looks like this:

with open('toy.txt' ,'r') as f:
    pattern = f.read().splitlines()
    print pattern

    for i in range(0, len(pattern)):
        if pattern[i] in pattern[i+1]:
            pattern.remove(pattern[i])
        print pattern

And I got the error message:

['ABCDE', 'ABCDEFG', 'ABCDEFGH', 'ABCDEFGHIJKLMNO', 'CEST', 'DBTSFDE', 'DBTSFDEO', 'EOEUDNBNUW', 'EAEUDNBNUW', 'FG', 'FGH']
['ABCDEFG', 'ABCDEFGH', 'ABCDEFGHIJKLMNO', 'CEST', 'DBTSFDE', 'DBTSFDEO', 'EOEUDNBNUW', 'EAEUDNBNUW', 'FG', 'FGH']
['ABCDEFG', 'ABCDEFGHIJKLMNO', 'CEST', 'DBTSFDE', 'DBTSFDEO', 'EOEUDNBNUW', 'EAEUDNBNUW', 'FG', 'FGH']
['ABCDEFG', 'ABCDEFGHIJKLMNO', 'CEST', 'DBTSFDE', 'DBTSFDEO', 'EOEUDNBNUW', 'EAEUDNBNUW', 'FG', 'FGH']
['ABCDEFG', 'ABCDEFGHIJKLMNO', 'CEST', 'DBTSFDEO', 'EOEUDNBNUW', 'EAEUDNBNUW', 'FG', 'FGH']
['ABCDEFG', 'ABCDEFGHIJKLMNO', 'CEST', 'DBTSFDEO', 'EOEUDNBNUW', 'EAEUDNBNUW', 'FG', 'FGH']
['ABCDEFG', 'ABCDEFGHIJKLMNO', 'CEST', 'DBTSFDEO', 'EOEUDNBNUW', 'EAEUDNBNUW', 'FG', 'FGH']
['ABCDEFG', 'ABCDEFGHIJKLMNO', 'CEST', 'DBTSFDEO', 'EOEUDNBNUW', 'EAEUDNBNUW', 'FGH']
Traceback (most recent call last):
  File "test.py", line 8, in <module>
    if pattern[i] in pattern[i+1]:
IndexError: list index out of range

标签： python list bioinformatics

11条回答

Ridiculous、

2楼-- · 2019-04-03 07:58

# assuming list is sorted:
pattern = ["ABCDE",
"ABCDEFG",
"ABCDEFGH",
"ABCDEFGHIJKLMNO",
"CEST",
"DBTSFDE",
"DBTSFDEO",
"EOEUDNBNUW",
"EAEUDNBNUW",
"FG",
"FGH"]

pattern = list(reversed(pattern))

def iterate_patterns():
    while pattern:
        i = pattern.pop()
        throw_it_away = False
        for p in pattern:
            if p.startswith(i):
                throw_it_away = True
                break
        if throw_it_away == False:
            yield i

print(list(iterate_patterns()))

Output:

['ABCDEFGHIJKLMNO', 'CEST', 'DBTSFDEO', 'EOEUDNBNUW', 'EAEUDNBNUW', 'FGH']

0人赞添加讨论(0) 举报

可以哭但决不认输i

3楼-- · 2019-04-03 07:58

You can use a binary tree whose insertion process attempts to find nodes that precede the value:

class Tree:
  def __init__(self, val=None):
    self.left, self.value, self.right = None, val, None
  def insert_val(self, _val):
    if self.value is None or _val.startswith(self.value):
       self.value = _val
    else:
       if _val < self.value:
          getattr(self.left, 'insert_val', lambda x:setattr(self, 'left', Tree(x)))(_val)
       else:
          getattr(self.right, 'insert_val', lambda x:setattr(self, 'right', Tree(x)))(_val)
  def flatten(self):
     return [*getattr(self.left, 'flatten', lambda :[])(), self.value, *getattr(self.right, 'flatten', lambda :[])()]

t = Tree()
for i in open('filename.txt'):
  t.insert_val(i.strip('\n'))
print(t.flatten())

Output:

['ABCDEFGHIJKLMNO', 'CEST', 'DBTSFDEO', 'EAEUDNBNUW', 'EOEUDNBNUW', 'FGH']

0人赞添加讨论(0) 举报

小情绪 Triste *

4楼-- · 2019-04-03 08:00

A simple way is to process the input file one line at a time, compare each line with the previous one and keep previous one if it is not contained in current one.

Code can be as simple as:

with open('toy.txt' ,'r') as f:
    old = next(f).strip()               # keep first line after stripping EOL 

    for pattern in f:
        pattern = pattern.strip()       # strip end of line...
        if old not in pattern:
            print old                   # keep old if it is not contained in current line
        old = pattern                   # and store current line for next iteration
    print old                           # do not forget last line

0人赞添加讨论(0) 举报

Ridiculous、

5楼-- · 2019-04-03 08:00

As stated in other answers, your error comes from calculating the length of your input at the start and then not updating it as you shorten the list.

Here's another take at a working solution:

with open('toy.txt', 'r') as infile:
    input_lines = reversed(map(lambda s: s.strip(), infile.readlines()))

output = []
for pattern in input_lines:
    if len(output) == 0 or not output[-1].startswith(pattern):        
        output.append(pattern)

print('\n'.join(reversed(output)))

0人赞添加讨论(0) 举报

Fickle 薄情

6楼-- · 2019-04-03 08:05

with open('demo.txt') as f:
    lines = f.readlines()

l_lines = len(lines)

n_lst = []

for i, line in enumerate(lines):
    line = line.strip()
    if i == l_lines - 1:
        if lines[-2] not in line:
            n_lst.append(line)
        break
    if line not in lines[i + 1]:
        n_lst.append(line)

print(n_lst)

Output

['ABCDEFGHIJKLMNO', 'CEST', 'DBTSFDEO', 'EOEUDNBNUW', 'EAEUDNBNUW', 'FGH']

0人赞添加讨论(0) 举报

上一页 1 2

Remove item from list based on the next item in sa

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间