Use Python to manipulate txt file presentation of

2019-07-29 03:13发布

问题:

I am trying to use Python in order to manipulate a text file from Format A:

Key1  
Key1value1  
Key1value2  
Key1value3  
Key2  
Key2value1  
Key2value2  
Key2value3  
Key3... 

Into Format B:

Key1 Key1value1  
Key1 Key1value2  
Key1 Key1value3  
Key2 Key2value1  
Key2 Key2value2  
Key2 Key2value3  
Key3 Key3value1...

Specifically, here is a brief look at the file itself (only one key shown, thousands more in the full file):

chr22:16287243: PASS  
patientID1  G/G  
patientID2  G/G  
patient ID3 G/G

And the desired output here:

chr22:16287243: PASS  patientID1    G/G  
chr22:16287243: PASS  patientID2    G/G  
chr22:16287243: PASS  patientID3    G/G

I've written the following code which can detect/display the keys, but I am having trouble writing the code to store the values associated with each key, and subsequently printing these key-value pairs. Can anyone please assist me with this task?

import sys
import re

records=[]

with open('filepath', 'r') as infile:
    for line in infile:
        variant = re.search("\Achr\d",line, re.I) # all variants start with "chr"
        if variant:
            records.append(line.replace("\n",""))
            #parse lines until a new variant is encountered

for r in records:
    print (r)

回答1:

Do it in one pass, without storing the lines:

with open("input") as infile, open("ouptut", "w") as outfile:
    for line in infile:
        if line.startswith("chr"):
            key = line.strip()
        else:
            print >> outfile, key, line.rstrip("\n")

This code assumes the first line contains a key and will fail otherwise.



回答2:

First, if strings start with a character sequence, don't use regular expressions. Much simpler and easier to read:

if line.startswith("chr")

The next step would be to use a very simple state machine. Like so:

current_key = ""

for line in file:
    if line.startswith("chr"):
        current_key = line.strip()

    else:
        print " ".join([current_key, line.strip()])


回答3:

If there are always the same number of values per key, islice is useful:

from itertools import islice

with open('input.txt') as fin, open('output.txt','w') as fout:
    for k in fin:
        for v in islice(fin,3):
            fout.write(' '.join((k.strip(),v)))