How to read and organize text files divided by key

2019-08-15 01:24发布

I'm working on this code (on python) that reads a text file. The text file contains information to construct a certain geometry, and it is separated by sections by using keywords, for example, the file:

*VERTICES
1 0 0 0
2 10 0 0
3 10 10 0
4 0 10 0
*EDGES
1 1 2
2 1 4
3 2 3
4 3 4

contains the information of a square with vertices at (0,0), (0,10), (10,0), (10,10). The "*Edges" part defines the connection between the vertices. The first number in each row is an ID number.

Here is my problem, the information in the text file is not necessarily in order, sometimes the "Vertices" section appears first, and some other times the "Edges" section will come first. I have other keywords as well, so I'm trying to avoid repeating if statements to test if each line has a new keyword.

What I have been doing is reading the text file multiple times, each time looking for a different keyword:

open file
read line by line
if line == *Points
store all the following lines in a list until a new *command is encountered
close file
open file (again)
read line by line
if line == *Edges
store all the following lines in a list until a new *command is encountered
close file
open file (again)
...

Can someone point out how can I identify these keywords without such a tedious procedure? Thanks.

6条回答
Deceive 欺骗
2楼-- · 2019-08-15 01:42

You can read the file once and store the contents in a dictionary. Since you have conveniently labeled the "command" lines with a *, you can use all lines beginning with a * as the dictionary key and all following lines as the values for that key. You can do this with a for loop:

with open('geometry.txt') as f:
    x = {}  
    key = None  # store the most recent "command" here
    for y in f.readlines()
        if y[0] == '*':
            key = y[1:] # your "command"
            x[key] = []
        else:
            x[key].append(y.split()) # add subsequent lines to the most recent key

Or you can take advantage of python's list and dictionary comprehensions to do the same thing in one line:

with open('test.txt') as f:
    x = {y.split('\n')[0]:[z.split() for z in y.strip().split('\n')[1:]] for y in f.read().split('*')[1:]}

which I'll admit is not very nice looking but it gets the job done by splitting the entire file into chunks between '*' characters and then using new lines and spaces as delimiters to break up the remaining chunks into dictionary keys and lists of lists (as dictionary values).

Details about splitting, stripping, and slicing strings can be found here

查看更多
做自己的国王
3楼-- · 2019-08-15 01:44

You should just create a dictionary of the sections. You could use a generator to read the file and yield each section in whatever order they arrive and build a dictionary from the results.
Here's some incomplete code that might help you along:

def load(f):
    with open(f) as file:
        section = next(file).strip()  # Assumes first line is always a section
        data = []
        for line in file:
            if line[0] == '*':        # Any appropriate test for a new section
                yield section, data
                section = line.strip()
                data = []
            else:
                data.append(list(map(int, line.strip().split())))
        yield section, data

Assuming the data above is in a file called data.txt:

>>> data = dict(load('data.txt'))
>>> data
{'*EDGES': [[1, 1, 2], [2, 1, 4], [3, 2, 3], [4, 3, 4]],
 '*VERTICES': [[1, 0, 0, 0], [2, 10, 0, 0], [3, 10, 10, 0], [4, 0, 10, 0]]}

Then you can reference each section, e.g.:

for edge in data['*EDGES']:
    ...
查看更多
一纸荒年 Trace。
4楼-- · 2019-08-15 01:49

A common strategy with this type of parsing is to build a function that can yield the data a section at a time. Then your top-level calling code can be fairly simple because it doesn't have to worry about the section logic at all. Here's an example with your data:

import sys

def main(file_path):
    # An example usage.
    for section_name, rows in sections(file_path):
        print('===============')
        print(section_name)
        for row in rows:
            print(row)

def sections(file_path):
    # Setup.
    section_name = None
    rows = []

    # Process the file.
    with open(file_path) as fh:
        for line in fh:
            # Section start: yield any rows we have so far,
            # and then update the section name.
            if line.startswith('*'):
                if rows:
                    yield (section_name, rows)
                    rows = []
                section_name = line[1:].strip()
            # Otherwise, just add another row.
            else:
                row = line.split()
                rows.append(row)

    # Don't forget the last batch of rows.
    if rows:
        yield (section_name, rows)

main(sys.argv[1])
查看更多
在下西门庆
5楼-- · 2019-08-15 01:54

A dictionary is probably the way to go given that your data isn't ordered. You can access it by section name after reading the file into a list. Note that the with keyword closes your file automatically.

Here's what it might look like:

# read the data file into a simple list:
with open('file.dat') as f:
    lines = list(f)

# get the line numbers for each section:
section_line_nos = [line for line, data in enumerate(lines) if '*' == data[0]]
# add a terminating line number to mark end of the file:
section_line_nos.append(len(lines))

# split each section off into a new list, all contained in a dictionary
# with the section names as keys
section_dict = {lines[section_line_no][1:]:lines[section_line_no + 1: section_line_nos[section_no + 1]] for section_no, section_line_no in enumerate(section_line_nos[:-1])}

You will get a dictionary that looks like this:

{'VERTICES': ['1 0 0 0', '2 10 0 0', '3 10 10 0', '4 0 10 0'], 'EDGES': ['1 1 2', '2 1 4', '3 2 3', '4 3 4']}

Access each section this way:

section_dict['EDGES']

Note that the above code assumes each section starts with *, and that no other line starts with *. If the first is not the case, you could make this change:

section_names = ['*EDGES', '*VERTICES']
section_line_nos = [line for line, data in enumerate(lines) if data.strip() in section_names]

Also note that this part of the section_dict code:

lines[section_line_no][1:]

...gets rid of the star at the beginning of each section name. If this is not desired, you can change that to:

lines[section_line_no]

If it is possible there will be undesired white space in your section name lines, you can do this to get rid of it:

lines[section_line_no].strip()[1:]

I haven't tested all of this yet but this is the general idea.

查看更多
冷血范
6楼-- · 2019-08-15 01:55

The fact that they are unordered I think lends itself well for parsing into a dictionary from which you can access values later. I wrote a function that you may find useful for this task:

features = ['POINTS','EDGES']

def parseFile(dictionary, f, features):
    """
    Creates a format where you can access a shape feature like:
        dictionary[shapeID][feature] = [  [1 1 1], [1,1,1] ... ]

    Assumes: all features although out of order occurs in the order
        shape1
            *feature1
                .
                .
                .
            *featuren
    Assumes all possible features are in in the list features

    f is input file handle
    """
    shapeID = 0
    found = []
    for line in f:

        if line[0] == '*' and found != features:
            found.append(line[1:]) #appends feature like POINTS to found
            feature = line[1:]

        elif line[0] == '*' and found == features:
            found = []
            shapeID += 1
            feature = line[1:] #current feature

        else:
            dictionary[shapeID][feature].append(
                [int(i) for i in line.split(' ')]
                )

    return dictionary

#to access the shape features you can get vertices like:

for vertice in dictionary[shapeID]['POINTS']:
    print vertice

#to access edges

for edge in dictionary[shapeID]['EDGES']:
    print edge
查看更多
爷、活的狠高调
7楼-- · 2019-08-15 01:55

Assuming your file is named 'data.txt'

from collections import defaultdict

def get_data():
    d = defaultdict(list)
    with open('data.txt') as f:
        key = None
        for line in f:
            if line.startswith('*'):
                key = line.rstrip()
                continue
            d[key].append(line.rstrip())
    return d

The returned defaultdict looks like this:

defaultdict(list,
            {'*EDGES': ['1 1 2', '2 1 4', '3 2 3', '4 3 4'],
             '*VERTICES': ['1 0 0 0', '2 10 0 0', '3 10 10 0', '4 0 10 0']})

You access the data just like a normal dictionary

d['*EDGES']
['1 1 2', '2 1 4', '3 2 3', '4 3 4']
查看更多
登录 后发表回答