I'm trying to process a file from the protein data bank which is separated by spaces (not \t). I have a .txt file and I want to extract specific rows and, from that rows, I want to extract only a few columns.
I need to do it in Python. I tried first with command line and used awk command with no problem, but I have no idea of how to do the same in Python.
Here is an extract of my file:
[...] SEQRES 6 B 80 ALA LEU SER ILE LYS LYS ALA GLN THR PRO GLN GLN TRP SEQRES 7 B 80 LYS PRO HELIX 1 1 THR A 68 SER A 81 1 14 HELIX 2 2 CYS A 97 LEU A 110 1 14 HELIX 3 3 ASN A 122 SER A 133 1 12 [...]
For example, I'd like to take only the 'HELIX' rows and then the 4th, 6th, 7th and 9th columns. I started reading the file line by line with a for loop and then extracted those rows starting with 'HELIX'... and that's all.
EDIT: This is the code I have right now, but the print doesn't work properly, only prints the first line of each block (HELIX SHEET AND DBREF)
#!/usr/bin/python
import sys
for line in open(sys.argv[1]):
if 'HELIX' in line:
helix = line.split()
elif 'SHEET'in line:
sheet = line.split()
elif 'DBREF' in line:
dbref = line.split()
print (helix), (sheet), (dbref)