NLP - Sentence Segmentation

2019-09-15 08:34发布

问题:

I am a newbie trying my hands on sentence segmentation in NLP. I am aware tokenizers are available for the same in NLTK. But I wanted to build my own sentence segmenter using Machine Learning algorithm like Decision Tree. But I am not able to gather training data for it. How should be the data. How should it be labelled, since I wanted to try first using supervised learning. Any sample data already available? Any help will be useful. I searched in net for nearly a week and now posting the same for help. Thanks in advance.

回答1:

As far as I know, Sentence splitters are typically implemented as a hybrid with a set of rules (the punctuation characters to consider) and some automatically learnt weights (for exceptions, such as abbreviations with a period, which don't act as a full stop). The weights can be learnt without supervision.

It's an interesting idea, however, to approach this with a plain ML-based system. For a supervised scheme, you could try a character-based sequence-labelling model with BIO labels. For example, your training data could look like this:

This is it! I'm leaving Dr. Smush in his box.
BIIIIIIIIIIOBIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

The predicted output will then also be BIIIIO..., and you'll have to split the original text at the characters labelled O. I'm not sure if this is the best approach, but if you try it, let me know if it's any good. Make sure you use n-grams of high orders (3-, 4-, 5-grams or even higher), since these are characters, not word tokens.

As for the training data, you can use any linguistically annotated corpus, since they are all sentence-split (eg. look at the ones included in NLTK). All you have to do is producing the BIO labels for training.