Working on a Python 3.6 read of a text file to extract relative lines to convert into a pandas dataframe.
What works: Searching for a phrase in a text document and converting the line into a pandas df.
import pandas as pd
df = pd.DataFrame()
list1 = []
list2 = []
with open('myfile.txt') as f:
for lineno, line in enumerate(f, 1):
if 'Project:' in line:
line = line.strip('\n')
list1.append(repr(line))
# Convert list1 into a df column
df = pd.DataFrame({'Project_Name':list1})
What doesn't work: Returning a relative line based on the search result. In my case I need to store the "relative" line -6 to -2 (earlier in the text) as Pandas columns.
with open('myfile.txt') as f:
for lineno, line in enumerate(f, 1):
if 'Project:' in line:
list2.append(repr(line)-6) #<--- can't use math here
Returns: TypeError: unsupported operand type(s) for -: 'str' and 'int'
Also tried using a range with partial success:
with open('myfile.txt') as f:
for lineno, line in enumerate(f, 1):
if 'Project' in line:
all_lines = f.readlines()
required_lines = [all_lines[i] for i in range(lineno-6,lineno-2)]
print (required_lines)
list2.append(required_lines) #<-- does not work
Python will print the first 4 target lines but it does not seem to be able to save it as a list or loop through each finding of "Project" in the text doc. Is there a better way to save the results of the relative line above (or below) the search term? Thanks much.
Text data looks like:
0 Exhibit 3
1 Date: February 2018
2 Description
3 Description
4 Description
5 2015
6 2016
7 2017
8 2018
9 $100.50 <---- Add these as different dataframe columns
10 $120.33 <----
11 $135.88 <----
12 $140.22 <----
13 Project A
14
15 Exhibit 4
16 Date: February 2018
17 Description
18 Description
19 2015
20 2016
21 2017
22 2018
23 $899.25 <----
24 $901.00 <----
25 $923.43 <----
26 $1002.02 <----
27 Project B
The reason your second solution is not working is because you are reading the file using a generator like object (
f
in your case), which one it finishes iterating through the file, will stop.Your iteration
for lineno, line in enumerate(f, 1):
is meant to iterate line by line inside the file, but in a memory efficient manner by only reading one line at a time. When you find a matching line you do,all_lines = f.readlines()
which consumes the generator. When the next iteration infor lineno, line in enumerate(f, 1):
is called it raises aStopIterationError
which causes the loop to stop.You can make your second solution work if you read the entire contents of the file first and then iterate through that list instead.
If you want to be memory efficient, you can maintain a FIFO queue of the required number of lines.
This might do the trick, it does make the assumption that there are always four values before the 'Project' line.
Or without the project included: