Split paragraph into sentences with titles and num

2019-05-07 02:38发布

问题:

I'm using the BreakIterator class in Java to break paragraph into sentences. This is my code :

public Map<String, Double> breakSentence(String document) {
    sentences = new HashMap<String, Double>();
    BreakIterator bi = BreakIterator.getSentenceInstance(Locale.US);
    bi.setText(document);

    Double tfIdf = 0.0;
    int start = bi.first();
    for(int end = bi.next(); end != BreakIterator.DONE; start = end, end = bi.next()) {
        String sentence = document.substring(start, end);

        sentences.put(sentence, tfIdf);
    }

    return sentences;
}

The problem is when the paragraph contain titles or numbers, for example :

"Prof. Roberts trying to solve a problem by writing a 1.200 lines of code."

What my code will produce is :

sentences :
Prof
Roberts trying to solve a problem by writing a 1
200 lines of code

Instead of 1 single sentence because of the period in titles and numbers.

Is there a way to fix this to handle titles and numbers with Java?

回答1:

Well this is a bit of a tricky situation, and I've come up with a sticky solution, but it works nevertheless. I'm new to Java myself so if a seasoned veteran wants to edit this or comment on it and make it more professional by all means, please make me look better.

I basically added some control measures to what you already have to check and see if words exist like Dr. Prof. Mr. Mrs. etc. and if those words exist, it just skips over that break and moves to the next break (keeping the original start position) looking for the NEXT end (preferably one that doesn't end after another Dr. or Mr. etc.)

I'm including my complete program so you can see it all:

import java.text.BreakIterator;
import java.util.*;

public class TestCode {

    private static final String[] ABBREVIATIONS = {
        "Dr." , "Prof." , "Mr." , "Mrs." , "Ms." , "Jr." , "Ph.D."
    };

    public static void main(String[] args) throws Exception {

        String text = "Prof. Roberts and Dr. Andrews trying to solve a " +
                      "problem by writing a 1.200 lines of code. This will " +
                      "work if Mr. Java writes solid code.";

        for (String s : breakSentence(text)) {
              System.out.println(s);
        }
    }

    public static List<String> breakSentence(String document) {

        List<String> sentenceList = new ArrayList<String>();
        BreakIterator bi = BreakIterator.getSentenceInstance(Locale.US);
        bi.setText(document);
        int start = bi.first();
        int end = bi.next();
        int tempStart = start;
        while (end != BreakIterator.DONE) {
            String sentence = document.substring(start, end);
            if (! hasAbbreviation(sentence)) {
                sentence = document.substring(tempStart, end);
                tempStart = end;
                sentenceList.add(sentence);
            }
            start = end; 
            end = bi.next();
        }
        return sentenceList;
    }

    private static boolean hasAbbreviation(String sentence) {
        if (sentence == null || sentence.isEmpty()) {
            return false;
        }
        for (String w : ABBREVIATIONS) {
            if (sentence.contains(w)) {
                return true;
            }
        }
        return false;
    }
}

What this does, is basically set up two starting points. The original starting point (the one you used) is still doing the same thing, but temp start doesn't move unless the string looks ready to be made into a sentence. It take the first sentence:

"Prof."

and checks to see if that broke because of a weird word (ie does it have Prof. Dr. or w/e in the sentence that might have caused that break) if it does, then tempStart doesn't move, it stays there and waits for the next chunk to come back. In my slightly more elaborate sentence the next chunk also has a weird word messing up the breaks:

"Roberts and Dr."

It takes that chunk and because it has a Dr. in it it continues on to the third chunk of sentence:

"Andrews trying to solve a problem by writing a 1.200 lines of code."

Once it reaches the third chunk that was broken and without any wierd titles that may have caused a false break, it then starts from temp start (which is still at the beginning) to the current end, basically joining all three parts together.

Now it sets the temp start to the current 'end' and continues.

Like I said this may not be a glamorous way to get what you want, but nobody else volunteered and it works shrug



回答2:

It appears that Prof. Roberts only gets split if Roberts begins with a capital letter.

If Roberts begins with a lowercase r, it does not get split.

So... I guess that's how BreakIterator deals with periods.

I'm sure further reading of the documentation will explain how this behavior can be modified.