Python 3 Reading Relative Lines in a text document

2019-07-29 06:12发布

问题:

Working on a Python 3.6 read of a text file to extract relative lines to convert into a pandas dataframe.

What works: Searching for a phrase in a text document and converting the line into a pandas df.

import pandas as pd
df = pd.DataFrame()
list1 = []
list2 = []

with open('myfile.txt') as f:
    for lineno, line in enumerate(f, 1):
        if 'Project:' in line:
            line = line.strip('\n')
            list1.append(repr(line))

# Convert list1 into a df column
df = pd.DataFrame({'Project_Name':list1})

What doesn't work: Returning a relative line based on the search result. In my case I need to store the "relative" line -6 to -2 (earlier in the text) as Pandas columns.

with open('myfile.txt') as f:
    for lineno, line in enumerate(f, 1):
        if 'Project:' in line:
            list2.append(repr(line)-6)  #<--- can't use math here

Returns: TypeError: unsupported operand type(s) for -: 'str' and 'int'

Also tried using a range with partial success:

with open('myfile.txt') as f:
    for lineno, line in enumerate(f, 1):
        if 'Project' in line:
            all_lines = f.readlines()
            required_lines = [all_lines[i] for i in range(lineno-6,lineno-2)]
            print (required_lines)
            list2.append(required_lines)  #<-- does not work

Python will print the first 4 target lines but it does not seem to be able to save it as a list or loop through each finding of "Project" in the text doc. Is there a better way to save the results of the relative line above (or below) the search term? Thanks much.

Text data looks like:

0  Exhibit 3
1  Date: February 2018
2  Description
3  Description
4  Description
5  2015
6  2016
7  2017
8  2018
9  $100.50    <----  Add these as different dataframe columns
10 $120.33    <----
11 $135.88    <----
12 $140.22    <----
13 Project A
14
15 Exhibit 4
16 Date: February 2018
17 Description
18 Description
19 2015
20 2016
21 2017
22 2018
23 $899.25    <----
24 $901.00    <----
25 $923.43    <----
26 $1002.02   <----
27 Project B

回答1:

This might do the trick, it does make the assumption that there are always four values before the 'Project' line.

>>> a = []
>>> with open('test.txt') as f:
...     prev_lines = []
...     for line in f:
...         prev_lines.append(line.strip('\n'))
...         if 'Project' in line:
...             a.append(prev_lines[-5:])
...             del prev_lines[:]
>>> df = pd.DataFrame(a, columns=list('ABCDi'))
>>> df
         A        B        C         D          i
0  $100.50  $120.33  $135.88   $140.22  Project A
1  $899.25  $901.00  $923.43  $1002.02  Project B

Or without the project included:

>>> a = []
>>> with open('test.txt') as f:
...     prev_lines = []
...     for line in f:
...         prev_lines.append(line.strip('\n'))
...         if 'Project' in line:
...             a.append(prev_lines[-5:-1])
...             del prev_lines[:]
>>> df = pd.DataFrame(a, columns=list('ABCD'))
>>> df
         A        B        C         D
0  $100.50  $120.33  $135.88   $140.22
1  $899.25  $901.00  $923.43  $1002.02


回答2:

The reason your second solution is not working is because you are reading the file using a generator like object (f in your case), which one it finishes iterating through the file, will stop.

Your iteration for lineno, line in enumerate(f, 1): is meant to iterate line by line inside the file, but in a memory efficient manner by only reading one line at a time. When you find a matching line you do, all_lines = f.readlines() which consumes the generator. When the next iteration in for lineno, line in enumerate(f, 1): is called it raises a StopIterationError which causes the loop to stop.

You can make your second solution work if you read the entire contents of the file first and then iterate through that list instead.

If you want to be memory efficient, you can maintain a FIFO queue of the required number of lines.