所以我有这样的服务器上的文件:
COADREAD ATF7 Missense_Mutation NGXA-AZ-3984 P17544 A436T
COADREAD ATG10 Missense_Mutation NGXA-AB-A010 Q9H0Y0 H133N
我的目标是找到它在文件和捕获/存储(这是我以后需要打印数)它背后的令牌的数量是436的5列(这个数字是假设在之间的ID(P17544)两个字母的),从第6列A436T有没有办法,我可以做到这一点? 我曾与LXML有点过,但仍然不知道如何做到这一点。 提前致谢
这里是我有什么
文件打开=( '的text.txt', 'R')
查找= {}
在文件行:
myid, token = file.rsplit(' ', 2)[1:]
token = token[1:-1]
Simplest method using builtin str
methods:
d = 'COADREAD ATF7 Missense_Mutation NGXA-AZ-3984 P17544 A436T'
myid, token = d.rsplit(' ', 2)[1:] # will except if can't be unpacked so you know you've got exactly 2 elements...
token = token[1:-1]
You could use regular expressions though if you wanted to specify numbers between two letters... re.match('[A-Z](\d{3})[A-Z]', token[1]) # or similar...
Clarification:
d.rsplit(' ', 2)
- starts splitting the string at ' '
s from the end which returns ['COADREAD ATF7 Missense_Mutation NGXA-AZ-3984', 'P17544', 'A436T']
. Assuming we're only looking for the last 2 elements, we remove the first one with a slice, so we get d.rsplit(' ', 2)[1:]
which gives ['P17544', 'A436T']
.
Using unpacking, we name our variables and also guarantee it has a length of two by using myid, token = d.rsplit(' ', 2)[1:]
- if it didn't have exactly two elements, the assignment will fail.
Now that myid
should be your id that you want you remove the first and last character from token using slicing which is token = token[1:-1]
.
Then:
print myid, token
# P17544 436
Comment about looking up:
For looking up after parsing the lines of the file:
lookup = {}
for line in file:
# do steps above so you have myid, token
lookup[myid] = token
Then lookup['P17544'] will return '436'
Hope that's clearer...
听起来很简单...沿空间分割 - 然后提取第五场......从第六字段中的所有数字。 还是我失去了一些东西?
>>> tokens = "COADREAD ATF7 Missense_Mutation NGXA-AZ-3984 P17544 A436T".split()
>>> print tokens[4]
P17544
>>> print ''.join([c for c in tokens[5] if c.isdigit()])
436