I am a newbie trying my hands on sentence segmentation in NLP. I am aware tokenizers are available for the same in NLTK. But I wanted to build my own sentence segmenter using Machine Learning algorithm like Decision Tree. But I am not able to gather training data for it. How should be the data. How should it be labelled, since I wanted to try first using supervised learning. Any sample data already available? Any help will be useful. I searched in net for nearly a week and now posting the same for help. Thanks in advance.
问题:
回答1:
As far as I know, Sentence splitters are typically implemented as a hybrid with a set of rules (the punctuation characters to consider) and some automatically learnt weights (for exceptions, such as abbreviations with a period, which don't act as a full stop). The weights can be learnt without supervision.
It's an interesting idea, however, to approach this with a plain ML-based system. For a supervised scheme, you could try a character-based sequence-labelling model with BIO
labels. For example, your training data could look like this:
This is it! I'm leaving Dr. Smush in his box.
BIIIIIIIIIIOBIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
The predicted output will then also be BIIIIO...
, and you'll have to split the original text at the characters labelled O
.
I'm not sure if this is the best approach, but if you try it, let me know if it's any good. Make sure you use n-grams of high orders (3-, 4-, 5-grams or even higher), since these are characters, not word tokens.
As for the training data, you can use any linguistically annotated corpus, since they are all sentence-split (eg. look at the ones included in NLTK).
All you have to do is producing the BIO
labels for training.