Python 3 Reading Relative Lines in a text document

Working on a Python 3.6 read of a text file to extract relative lines to convert into a pandas dataframe.

What works: Searching for a phrase in a text document and converting the line into a pandas df.

import pandas as pd
df = pd.DataFrame()
list1 = []
list2 = []

with open('myfile.txt') as f:
    for lineno, line in enumerate(f, 1):
        if 'Project:' in line:
            line = line.strip('\n')
            list1.append(repr(line))

# Convert list1 into a df column
df = pd.DataFrame({'Project_Name':list1})

What doesn't work: Returning a relative line based on the search result. In my case I need to store the "relative" line -6 to -2 (earlier in the text) as Pandas columns.

with open('myfile.txt') as f:
    for lineno, line in enumerate(f, 1):
        if 'Project:' in line:
            list2.append(repr(line)-6)  #<--- can't use math here

Returns: TypeError: unsupported operand type(s) for -: 'str' and 'int'

Also tried using a range with partial success:

with open('myfile.txt') as f:
    for lineno, line in enumerate(f, 1):
        if 'Project' in line:
            all_lines = f.readlines()
            required_lines = [all_lines[i] for i in range(lineno-6,lineno-2)]
            print (required_lines)
            list2.append(required_lines)  #<-- does not work

Python will print the first 4 target lines but it does not seem to be able to save it as a list or loop through each finding of "Project" in the text doc. Is there a better way to save the results of the relative line above (or below) the search term? Thanks much.

Text data looks like:

0  Exhibit 3
1  Date: February 2018
2  Description
3  Description
4  Description
5  2015
6  2016
7  2017
8  2018
9  $100.50    <----  Add these as different dataframe columns
10 $120.33    <----
11 $135.88    <----
12 $140.22    <----
13 Project A
14
15 Exhibit 4
16 Date: February 2018
17 Description
18 Description
19 2015
20 2016
21 2017
22 2018
23 $899.25    <----
24 $901.00    <----
25 $923.43    <----
26 $1002.02   <----
27 Project B

标签： pandas search text python-3.6 relative

2条回答

放我归山

2楼-- · 2019-07-29 06:43

The reason your second solution is not working is because you are reading the file using a generator like object (f in your case), which one it finishes iterating through the file, will stop.

Your iteration for lineno, line in enumerate(f, 1): is meant to iterate line by line inside the file, but in a memory efficient manner by only reading one line at a time. When you find a matching line you do, all_lines = f.readlines() which consumes the generator. When the next iteration in for lineno, line in enumerate(f, 1): is called it raises a StopIterationError which causes the loop to stop.

You can make your second solution work if you read the entire contents of the file first and then iterate through that list instead.

If you want to be memory efficient, you can maintain a FIFO queue of the required number of lines.

0人赞添加讨论(0) 举报

等我变得足够好

3楼-- · 2019-07-29 06:46

This might do the trick, it does make the assumption that there are always four values before the 'Project' line.

>>> a = []
>>> with open('test.txt') as f:
...     prev_lines = []
...     for line in f:
...         prev_lines.append(line.strip('\n'))
...         if 'Project' in line:
...             a.append(prev_lines[-5:])
...             del prev_lines[:]
>>> df = pd.DataFrame(a, columns=list('ABCDi'))
>>> df
         A        B        C         D          i
0  $100.50  $120.33  $135.88   $140.22  Project A
1  $899.25  $901.00  $923.43  $1002.02  Project B

Or without the project included:

>>> a = []
>>> with open('test.txt') as f:
...     prev_lines = []
...     for line in f:
...         prev_lines.append(line.strip('\n'))
...         if 'Project' in line:
...             a.append(prev_lines[-5:-1])
...             del prev_lines[:]
>>> df = pd.DataFrame(a, columns=list('ABCD'))
>>> df
         A        B        C         D
0  $100.50  $120.33  $135.88   $140.22
1  $899.25  $901.00  $923.43  $1002.02

0人赞添加讨论(0) 举报

Python 3 Reading Relative Lines in a text document

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间