Is there way to find the pharse and capture next t

2019-07-31 07:09发布

So I have a file of this on the server:

COADREAD ATF7 Missense_Mutation NGXA-AZ-3984 P17544 A436T

COADREAD ATG10 Missense_Mutation NGXA-AB-A010 Q9H0Y0 H133N

my goal is find the id (P17544), which in column 5 of the file and capture/store(which i need to print that number later) the number of the token behind it which is 436(this number is suppose to be in between two letter) from A436T in column 6. Is there way that I can do this? I worked a little bit with lxml before but still not sure how to do this. thanks in advance

Here is what I have

file = open('text.txt','r')

lookup={}

for line in file:

myid, token = file.rsplit(' ', 2)[1:]

token = token[1:-1] 

2条回答
神经病院院长
2楼-- · 2019-07-31 07:43

Sounds quite easy ... split along the spaces - then extract fifth field ... and all digit from the sixth field. Or am I missing something?

>>> tokens = "COADREAD ATF7 Missense_Mutation NGXA-AZ-3984 P17544 A436T".split()
>>> print tokens[4]
P17544
>>> print ''.join([c for c in tokens[5] if c.isdigit()])
436
查看更多
劳资没心,怎么记你
3楼-- · 2019-07-31 07:47

Simplest method using builtin str methods:

d = 'COADREAD ATF7 Missense_Mutation NGXA-AZ-3984 P17544 A436T'
myid, token = d.rsplit(' ', 2)[1:] # will except if can't be unpacked so you know you've got exactly 2 elements...
token = token[1:-1]

You could use regular expressions though if you wanted to specify numbers between two letters... re.match('[A-Z](\d{3})[A-Z]', token[1]) # or similar...

Clarification:

d.rsplit(' ', 2) - starts splitting the string at ' 's from the end which returns ['COADREAD ATF7 Missense_Mutation NGXA-AZ-3984', 'P17544', 'A436T'] . Assuming we're only looking for the last 2 elements, we remove the first one with a slice, so we get d.rsplit(' ', 2)[1:] which gives ['P17544', 'A436T'].

Using unpacking, we name our variables and also guarantee it has a length of two by using myid, token = d.rsplit(' ', 2)[1:] - if it didn't have exactly two elements, the assignment will fail.

Now that myid should be your id that you want you remove the first and last character from token using slicing which is token = token[1:-1].

Then:

print myid, token
# P17544 436

Comment about looking up:

For looking up after parsing the lines of the file:

lookup = {}
for line in file:
    # do steps above so you have myid, token
    lookup[myid] = token

Then lookup['P17544'] will return '436'

Hope that's clearer...

查看更多
登录 后发表回答