I'm using the BreakIterator class in Java to break paragraph into sentences. This is my code :
public Map<String, Double> breakSentence(String document) {
sentences = new HashMap<String, Double>();
BreakIterator bi = BreakIterator.getSentenceInstance(Locale.US);
bi.setText(document);
Double tfIdf = 0.0;
int start = bi.first();
for(int end = bi.next(); end != BreakIterator.DONE; start = end, end = bi.next()) {
String sentence = document.substring(start, end);
sentences.put(sentence, tfIdf);
}
return sentences;
}
The problem is when the paragraph contain titles or numbers, for example :
"Prof. Roberts trying to solve a problem by writing a 1.200 lines of code."
What my code will produce is :
sentences :
Prof
Roberts trying to solve a problem by writing a 1
200 lines of code
Instead of 1 single sentence because of the period in titles and numbers.
Is there a way to fix this to handle titles and numbers with Java?
It appears that
Prof. Robert
s only gets split ifRoberts
begins with a capital letter.If
Roberts
begins with a lowercaser
, it does not get split.So... I guess that's how
BreakIterator
deals with periods.I'm sure further reading of the documentation will explain how this behavior can be modified.
Well this is a bit of a tricky situation, and I've come up with a sticky solution, but it works nevertheless. I'm new to Java myself so if a seasoned veteran wants to edit this or comment on it and make it more professional by all means, please make me look better.
I basically added some control measures to what you already have to check and see if words exist like Dr. Prof. Mr. Mrs. etc. and if those words exist, it just skips over that break and moves to the next break (keeping the original start position) looking for the NEXT end (preferably one that doesn't end after another Dr. or Mr. etc.)
I'm including my complete program so you can see it all:
What this does, is basically set up two starting points. The original starting point (the one you used) is still doing the same thing, but temp start doesn't move unless the string looks ready to be made into a sentence. It take the first sentence:
and checks to see if that broke because of a weird word (ie does it have Prof. Dr. or w/e in the sentence that might have caused that break) if it does, then tempStart doesn't move, it stays there and waits for the next chunk to come back. In my slightly more elaborate sentence the next chunk also has a weird word messing up the breaks:
It takes that chunk and because it has a Dr. in it it continues on to the third chunk of sentence:
Once it reaches the third chunk that was broken and without any wierd titles that may have caused a false break, it then starts from temp start (which is still at the beginning) to the current end, basically joining all three parts together.
Now it sets the temp start to the current 'end' and continues.
Like I said this may not be a glamorous way to get what you want, but nobody else volunteered and it works shrug