print first paragraph in python

2019-04-07 14:14发布

I have a book in a text file and I need to print first paragraph of each section. I thought that if I found a text between \n\n and \n I can find my answer. Here is my codes and it didn't work. Can you tell me that where am I wrong ?

lines = [line.rstrip('\n') for line in open('G:\\aa.txt')]

check = -1
first = 0
last = 0

for i in range(len(lines)):
    if lines[i] == "": 
            if lines[i+1]=="":
                check = 1
                first = i +2
    if i+2< len(lines):
        if lines[i+2] == "" and check == 1:
            last = i+2
while (first < last):
    print(lines[first])
    first = first + 1

Also I found a code in stackoverflow I tried it too but it just printed an empty array.

f = open("G:\\aa.txt").readlines()
flag=False
for line in f:
        if line.startswith('\n\n'):
            flag=False
        if flag:
            print(line)
        elif line.strip().endswith('\n'):
            flag=True

I shared a sample section of this book in belown.

I

THE LAY OF THE LAND

There is a vast field of fascinating human interest, lying only just outside our doors, which as yet has been but little explored. It is the Field of Animal Intelligence.

Of all the kinds of interest attaching to the study of the world's wild animals, there are none that surpass the study of their minds, their morals, and the acts that they perform as the results of their mental processes.

II

WILD ANIMAL TEMPERAMENT & INDIVIDUALITY

What I am trying to do here is, find the uppercase lines, and put them all in an array. Then, using the index method, I will find the first and last paragraphs of each section by comparing the indexes of these elements of this array I created.

Output should be like this :

There is a vast field of fascinating human interest, lying only just outside our doors, which as yet has been but little explored. It is the Field of Animal Intelligence.

What I am trying to do here is, find the uppercase lines, and put them all in an array. Then, using the index method, I will find the first and last paragraphs of each section by comparing the indexes of these elements of this array I created.

5条回答
ゆ 、 Hurt°
2楼-- · 2019-04-07 14:44

If you want to group the sections you can use itertools.groupby using empty lines as the delimiters:

from itertools import groupby
with open("in.txt") as f:
    for k, sec in groupby(f,key=lambda x: bool(x.strip())):
        if k:
            print(list(sec))

With some more itertools foo we can get the sections using the uppercase title as the delimiter:

from itertools import groupby, takewhile

with open("in.txt") as f:
    grps = groupby(f,key=lambda x: x.isupper())
    for k, sec in grps:
        # if we hit a title line
        if k: 
            # pull all paragraphs
            v = next(grps)[1]
            # skip two empty lines after title
            next(v,""), next(v,"")

            # take all lines up to next empty line/second paragraph
            print(list(takewhile(lambda x: bool(x.strip()), v)))

Which would give you:

['There is a vast field of fascinating human interest, lying only just outside our doors, which as yet has been but little explored. It is the Field of Animal Intelligence.\n']
['What I am trying to do here is, find the uppercase lines, and put them all in an array. Then, using the index method, I will find the first and last paragraphs of each section by comparing the indexes of these elements of this array I created.']

The start of each section has an all uppercase title so once we hit that we know there are two empty lines then the first paragraph and the pattern repeats.

To break it into using loops:

from itertools import groupby  
from itertools import groupby
def parse_sec(bk):
    with open(bk) as f:
        grps = groupby(f, key=lambda x: bool(x.isupper()))
        for k, sec in grps:
            if k:
                print("First paragraph from section titled :{}".format(next(sec).rstrip()))
                v = next(grps)[1]
                next(v, ""),next(v,"")
                for line in v:
                    if not line.strip():
                        break
                    print(line)

For your text:

In [11]: cat -E in.txt

THE LAY OF THE LAND$
$
$
There is a vast field of fascinating human interest, lying only just outside our doors, which as yet has been but little explored. It is the Field of Animal Intelligence.$
$
Of all the kinds of interest attaching to the study of the world's wild animals, there are none that surpass the study of their minds, their morals, and the acts that they perform as the results of their mental processes.$
$
$
WILD ANIMAL TEMPERAMENT & INDIVIDUALITY$
$
$
What I am trying to do here is, find the uppercase lines, and put them all in an array. Then, using the index method, I will find the first and last paragraphs of each section by comparing the indexes of these elements of this array I created.

The dollar signs are the new lines, the output is:

In [12]: parse_sec("in.txt")
First paragraph from section titled :THE LAY OF THE LAND
There is a vast field of fascinating human interest, lying only just outside our doors, which as yet has been but little explored. It is the Field of Animal Intelligence.

First paragraph from section titled :WILD ANIMAL TEMPERAMENT & INDIVIDUALITY
What I am trying to do here is, find the uppercase lines, and put them all in an array. Then, using the index method, I will find the first and last paragraphs of each section by comparing the indexes of these elements of this array I created.
查看更多
在下西门庆
3楼-- · 2019-04-07 14:48

Go over the code you have found, line by line.

f = open("G:\\aa.txt").readlines()
flag=False
for line in f:
        if line.startswith('\n\n'):
            flag=True
        if flag:
            print(line)
        elif line.strip().endswith('\n'):
            flag=True

It seems it never sets the flag variable as true.

And if you can share some samples from your book it will be more helpful for everyone.

查看更多
▲ chillily
4楼-- · 2019-04-07 14:51

There's always regex....

import re
with open("in.txt", "r") as fi:
    data = fi.read()
paras = re.findall(r"""
                   [IVXLCDM]+\n\n   # Line of Roman numeral characters
                   [^a-z]+\n\n      # Line without lower case characters
                   (.*?)\n          # First paragraph line
                   """, data, re.VERBOSE)
print "\n\n".join(paras)
查看更多
趁早两清
5楼-- · 2019-04-07 15:02

This should work, as long as there are no paragraphs with all caps:

    f = open('file.txt')

    for line in f:
    line = line.strip()
    if line:  
        for c in line:
            if c < 'A' or c > 'Z': # check for non-uppercase chars
                break
        else:        # means the line is made of all caps i.e. I, II, etc, meaning new section
            f.readline()  # discard chapter headers and empty lines
            f.readline()
            f.readline()
            print(f.readline().rstrip()) # print first paragraph

    f.close()

If you want to get the last paragraph too, you can keep track of the line last seen that contained lowercase chars and then as soon as you find an all uppercase line (I, II, etc), indicating a new section, then you print the most recent line, since that would be the last paragraph in the previous section.

查看更多
迷人小祖宗
6楼-- · 2019-04-07 15:06

TXR solution

$ txr firstpar.txr data
There is a vast field of fascinating human interest, lying only just outside our doors, which as yet has been but little explored. It is the Field of Animal Intelligence.
What I am trying to do here is, find the uppercase lines, and put them all in an array. Then, using the index method, I will find the first and last paragraphs of each section by comparing the indexes of these elements of this array I created.

Code in firstpar.txr:

@(repeat)
@num

@title

@firstpar
@  (require (and (< (length num) 5)
                 [some title chr-isupper]
                 (not [some title chr-islower])))
@  (do (put-line firstpar))
@(end)

Basically we are searching the input for a pattern match for the three-element multi-line pattern which binds the num, title and firstpar variables. Now this pattern, as such, can match in wrong places, so add some constraining heuristics with a require assertion. The section number is required to be a short line, and a title line must contain some upper case letters, and no lower-case ones. This expression is written in TXR Lisp.

If we get a match with this constraint then we output the string captured in the firstpar variable.

查看更多
登录 后发表回答