How to split Text into paragraphs using NLTK nltk.

2019-06-08 09:57发布

I found this Split Text into paragraphs NLTK - usage of nltk.tokenize.texttiling? explaining how to feed a text into texttiling, however I am unable to actually return a text tokenized by paragraph / topic change as shown here under texttiling http://www.nltk.org/api/nltk.tokenize.html.

When I feed my text into texttiling, I get the same untokenized text back, but as a list, which is of no use to me.

    tt = nltk.tokenize.texttiling.TextTilingTokenizer(w=20, k=10,similarity_method=0, stopwords=None, smoothing_method=[0], smoothing_width=2, smoothing_rounds=1, cutoff_policy=1, demo_mode=False)

    tiles = tt.tokenize(text) # same text returned

What I have are emails that follow this basic structure

    From: X
    To: Y                             (LOGISTICS)
    Date: 10/03/2017

    Hello team,                       (INTRO)

    Some text here representing
    the body                          (BODY)
    of the text.

    Regards,                          (OUTRO)
    X

    *****DISCLAIMER*****              (POST EMAIL DISCLAIMER)
    THIS EMAIL IS CONFIDENTIAL
    IF YOU ARE NOT THE INTENDED RECIPIENT PLEASE DELETE THIS EMAIL

If we call this email string s, it would look like

    s = "From: X\nTo: Y\nDate: 10/03/2017 Hello team,\nSome text here representing the body of the text. Regards,\nX\n\n*****DISCLAIMER*****\nTHIS EMAIL IS CONFIDENTIAL\nIF YOU ARE NOT THE INTENDED RECIPIENT PLEASE DELETE THIS EMAIL"

What I want to do is return these 5 sections / paragraphs of string s - LOGISTICS, INTRO, BODY, OUTRO, POST EMAIL DISCLAIMER - separately so I can remove everything but the BODY of the text. How can I return these 5 sections separately using nltk texttiling?

*** Not all emails follow this same structure or have the same wording, so I can't use regular expressions.

1条回答
家丑人穷心不美
2楼-- · 2019-06-08 10:14

What about using splitlines? Or do you have to use the nltk package?

email = """    From: X
    To: Y                             (LOGISTICS)
    Date: 10/03/2017

    Hello team,                       (INTRO)

    Some text here representing
    the body                          (BODY)
    of the text.

    Regards,                          (OUTRO)
    X

    *****DISCLAIMER*****              (POST EMAIL DISCLAIMER)
    THIS EMAIL IS CONFIDENTIAL
    IF YOU ARE NOT THE INTENDED RECIPIENT PLEASE DELETE THIS EMAIL"""

y = [s.strip() for s in email.splitlines()]

print(y)
查看更多
登录 后发表回答