Is there way to find the pharse and capture next t

So I have a file of this on the server:

COADREAD ATF7 Missense_Mutation NGXA-AZ-3984 P17544 A436T

COADREAD ATG10 Missense_Mutation NGXA-AB-A010 Q9H0Y0 H133N

my goal is find the id (P17544), which in column 5 of the file and capture/store(which i need to print that number later) the number of the token behind it which is 436(this number is suppose to be in between two letter) from A436T in column 6. Is there way that I can do this? I worked a little bit with lxml before but still not sure how to do this. thanks in advance

Here is what I have

file = open('text.txt','r')

lookup={}

for line in file:

myid, token = file.rsplit(' ', 2)[1:]

token = token[1:-1]

标签： python parsing

2条回答

神经病院院长

2楼-- · 2019-07-31 07:43

Sounds quite easy ... split along the spaces - then extract fifth field ... and all digit from the sixth field. Or am I missing something?

>>> tokens = "COADREAD ATF7 Missense_Mutation NGXA-AZ-3984 P17544 A436T".split()
>>> print tokens[4]
P17544
>>> print ''.join([c for c in tokens[5] if c.isdigit()])
436

0人赞添加讨论(0) 举报

劳资没心，怎么记你

3楼-- · 2019-07-31 07:47

Simplest method using builtin str methods:

d = 'COADREAD ATF7 Missense_Mutation NGXA-AZ-3984 P17544 A436T'
myid, token = d.rsplit(' ', 2)[1:] # will except if can't be unpacked so you know you've got exactly 2 elements...
token = token[1:-1]

You could use regular expressions though if you wanted to specify numbers between two letters... re.match('[A-Z](\d{3})[A-Z]', token[1]) # or similar...

Clarification:

d.rsplit(' ', 2) - starts splitting the string at ' 's from the end which returns ['COADREAD ATF7 Missense_Mutation NGXA-AZ-3984', 'P17544', 'A436T']. Assuming we're only looking for the last 2 elements, we remove the first one with a slice, so we get d.rsplit(' ', 2)[1:] which gives ['P17544', 'A436T'].

Using unpacking, we name our variables and also guarantee it has a length of two by using myid, token = d.rsplit(' ', 2)[1:] - if it didn't have exactly two elements, the assignment will fail.

Now that myid should be your id that you want you remove the first and last character from token using slicing which is token = token[1:-1].

Then:

print myid, token
# P17544 436

Comment about looking up:

For looking up after parsing the lines of the file:

lookup = {}
for line in file:
    # do steps above so you have myid, token
    lookup[myid] = token

Then lookup['P17544'] will return '436'

Hope that's clearer...

0人赞添加讨论(0) 举报

Is there way to find the pharse and capture next t

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间