可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I'm using the NLTK's PUNKT sentence tokenizer to split a file into a list of sentences, and would like to preserve the empty lines within the file:

from nltk import data
tokenizer = data.load('tokenizers/punkt/english.pickle')
s = "That was a very loud beep.\n\n I don't even know\n if this is working. Mark?\n\n Mark are you there?\n\n\n"
sentences = tokenizer.tokenize(s)
print sentences

I would like this to print:

['That was a very loud beep.\n\n', "I don't even know\n if this is working.", 'Mark?\n\n', 'Mark are you there?\n\n\n']

But the content that's actually printed shows that the trailing empty lines have been removed from the first and third sentences:

['That was a very loud beep.', "I don't even know\n if this is working.", 'Mark?', 'Mark are you there?\n\n\n']

Other tokenizers in NLTK have a blanklines='keep' parameter, but I don't see any such option in the case of the Punkt tokenizer. It's very possible I'm missing something simple. Is there a way to retrain these trailing empty lines using the Punkt sentence tokenizer? I'd be grateful for any insights others can offer!

回答1:

The problem

Sadly, you can't make the tokenizer keep the blanklines, not with the way the it is written.

Starting here and following the function calls through span_tokenize() and _slices_from_text(), you can see there is a condition

if match.group('next_tok'):

that is designed to ensure the tokenizer skips whitespace until the next possible sentence starting token occurs. Looking for the regex this refers to, we end up looking at _period_context_fmt, where we see that the next_tok named group is preceded by \s+, where blanklines won't be captured.

The solution

Break it down, change the part that you don't like, reassemble your custom solution.

Now this regex is in the PunktLanguageVars class, itself used to initialize the PunktSentenceTokenizer class. We just have to derive a custom class from PunktLanguageVars and fix the regex the way we want it to be.

The fix we want is to include trailing newlines at the end of a sentence, so I suggest replacing the _period_context_fmt, going from this:

_period_context_fmt = r"""
    \S*                          # some word material
    %(SentEndChars)s             # a potential sentence ending
    (?=(?P<after_tok>
        %(NonWord)s              # either other punctuation
        |
        \s+(?P<next_tok>\S+)     # or whitespace and some other token
    ))"""

to this:

_period_context_fmt = r"""
    \S*                          # some word material
    %(SentEndChars)s             # a potential sentence ending
    \s*                       #  <-- THIS is what I changed
    (?=(?P<after_tok>
        %(NonWord)s              # either other punctuation
        |
        (?P<next_tok>\S+)     #  <-- Normally you would have \s+ here
    ))"""

Now a tokenizer using this regex instead of the older will include 0 or more \s characters after the end of a sentence.

The whole script

import nltk.tokenize.punkt as pkt

class CustomLanguageVars(pkt.PunktLanguageVars):

    _period_context_fmt = r"""
        \S*                          # some word material
        %(SentEndChars)s             # a potential sentence ending
        \s*                       #  <-- THIS is what I changed
        (?=(?P<after_tok>
            %(NonWord)s              # either other punctuation
            |
            (?P<next_tok>\S+)     #  <-- Normally you would have \s+ here
        ))"""

custom_tknzr = pkt.PunktSentenceTokenizer(lang_vars=CustomLanguageVars())

s = "That was a very loud beep.\n\n I don't even know\n if this is working. Mark?\n\n Mark are you there?\n\n\n"

print(custom_tknzr.tokenize(s))

This outputs:

['That was a very loud beep.\n\n ', "I don't even know\n if this is working. ", 'Mark?\n\n ', 'Mark are you there?\n\n\n']

回答2:

Split the input into paragraphs, splitting on a capturing regexp (which returns the captured string as well):

paras = re.split("(\n\s*\n)", sentences)

You can then apply nltk.sent_tokenize() to the individual paragraphs, and process the results by paragraph or flatten the list-- whatever best suits your further use.

sents_by_para = [ nltk.sent_tokenize(p) for p in paras ]
flat = [ sent for par in sents_by_para for sent in par ]

(It seems that sent_tokenize() doesn't mangle whitespace-only strings, so there's no need to check and exclude them from processing.)

If you specifically want to have the whitespace attached to the previous sentence, you can easily stick it back on:

collapsed = []
for s in flat:
    if s.isspace() and len(collapsed) > 0:
        collapsed[-1] += s
    else:
        collapsed.append(s)

回答3:

I would go with itertools.groupby, see Python: How to loop through blocks of lines:

alvas@ubi:~$ echo """This is a foo bar sentence,
that is also a foo bar sentence.

But I don't like foobars.
Yes you do like bars with foos, no?


I'm not sure whether you like bar bar!
Neither do I like black sheep.""" > test.in



alvas@ubi:~$ python
>>> from nltk import sent_tokenize
>>> import itertools
>>> with open('test.in', 'r') as fin:
...     for key, group in itertools.groupby(fin, lambda x: x!='\n'):
...             if key:
...                     print list(group)
... 
['This is a foo bar sentence,\n', 'that is also a foo bar sentence.\n']
["But I don't like foobars.\n", 'Yes you do like bars with foos, no?\n']
["I'm not sure whether you like bar bar!\n", 'Neither do I like black sheep.\n']

And after that if you want to do a sent_tokenize or other punkt models within the group:

>>> with open('test.in', 'r') as fin:
...     for key, group in itertools.groupby(fin, lambda x: x!='\n'):
...             if key:
...                     paragraph = " ".join(line.strip() for line in group)
...                     print sent_tokenize(paragraph)
... 
['This is a foo bar sentence, that is also a foo bar sentence.']
["But I don't like foobars.", 'Yes you do like bars with foos, no?']
["I'm not sure whether you like bar bar!", 'Neither do I like black sheep.']

(Note: the more computationally efficient method would be to use mmap, see https://stackoverflow.com/a/3915398/610569 . But for the size I work on (~20 million tokens) itertools.groupby was sufficient)

回答4:

In the end, I ended up combining insights from both @alexis and @HugoMailhot so that I could preserve linebreaks in cases where a single paragraph has multiple sentences and/or linebreaks:

import re, nltk, sys, codecs
import nltk.tokenize.punkt as pkt
from nltk import data

class CustomLanguageVars(pkt.PunktLanguageVars):

    _period_context_fmt = r"""
        \S*                          # some word material
        %(SentEndChars)s             # a potential sentence ending
        \s*                       #  <-- THIS is what I changed
        (?=(?P<after_tok>
            %(NonWord)s              # either other punctuation
            |
            (?P<next_tok>\S+)     #  <-- Normally you would have \s+ here
        ))"""

custom_tokenizer = pkt.PunktSentenceTokenizer(lang_vars=CustomLanguageVars())

def sentence_split(s):
        '''Read in a string and return a list of sentences with linebreaks intact'''
        paras = re.split("(\n\s*\n)", s)
        sents_by_para = [custom_tokenizer.tokenize(p) for p in paras ]
        flat = [ sent for par in sents_by_para for sent in par ]

        collapsed = []
        for s in flat:
            if s.isspace() and len(collapsed) > 0:
                collapsed[-1] += s
            else:
                collapsed.append(s)

        return collapsed

if __name__ == "__main__":
        s = codecs.open(sys.argv[1],'r','utf-8').read()
        sentences = sentence_split(s)