So I have a file of this on the server:
COADREAD ATF7 Missense_Mutation NGXA-AZ-3984 P17544 A436T
COADREAD ATG10 Missense_Mutation NGXA-AB-A010 Q9H0Y0 H133N
my goal is find the id (P17544), which in column 5 of the file and capture/store(which i need to print that number later) the number of the token behind it which is 436(this number is suppose to be in between two letter) from A436T in column 6. Is there way that I can do this? I worked a little bit with lxml before but still not sure how to do this. thanks in advance
Here is what I have
file = open('text.txt','r')
lookup={}
for line in file:
myid, token = file.rsplit(' ', 2)[1:]
token = token[1:-1]
Sounds quite easy ... split along the spaces - then extract fifth field ... and all digit from the sixth field. Or am I missing something?
Simplest method using builtin
str
methods:You could use regular expressions though if you wanted to specify numbers between two letters...
re.match('[A-Z](\d{3})[A-Z]', token[1]) # or similar...
Clarification:
d.rsplit(' ', 2)
- starts splitting the string at' '
s from the end which returns['COADREAD ATF7 Missense_Mutation NGXA-AZ-3984', 'P17544', 'A436T']
. Assuming we're only looking for the last 2 elements, we remove the first one with a slice, so we getd.rsplit(' ', 2)[1:]
which gives['P17544', 'A436T']
.Using unpacking, we name our variables and also guarantee it has a length of two by using
myid, token = d.rsplit(' ', 2)[1:]
- if it didn't have exactly two elements, the assignment will fail.Now that
myid
should be your id that you want you remove the first and last character from token using slicing which istoken = token[1:-1]
.Then:
Comment about looking up:
For looking up after parsing the lines of the file:
Then lookup['P17544'] will return '436'
Hope that's clearer...