Training Tagger with Custom Tags in NLTK

I have a document with tagged data in the format Hi here's my [KEYWORD phone number], let me know when you wanna hangout: [PHONE 7802708523]. I live in a [PROP_TYPE condo] in [CITY New York]. I want to train a model based on a set of these type of tagged documents, and then use my model to tag new documents. Is this possible in NLTK? I have looked at chunking and NLTK-Trainer scripts, but these have a restricted set of tags and corpora, while my dataset has custom tags.

标签： nlp nltk information-extraction supervised-learning

2条回答

乱世女痞

2楼-- · 2019-01-25 19:00

As @AleksandarSavkov wrote already, this is essentially a named entity recognition (NER) task-- or more generally a chunking task, as you already realize. How to do it is covered nicely in chapter 7 of the NLTK book. I recommend you ignore the sections on regexp tagging and use the approach in section 3, Developing and evaluating chunkers. It includes code samples you can use verbatim to create a chunker (the ConsecutiveNPChunkTagger). Your responsibility is to select features that will give you good performance.

You'll need to transform your data into the IOB format expected by the NLTK's architecture; it expects part of speech tags, so the first step should be to run your input through a POS tagger; nltk.pos_tag() will do a decent enough job (once you strip off markup like [KEYWORD ...]), and requires no additional software to be installed. When your corpus is in the following format (word -- POS-tag -- IOB-tag), you are ready to train a recognizer:

Hi NNP O
here RB O
's POS O
my PRP$ O
phone NN B-KEYWORD
number NN I-KEYWORD
, , O
let VB O
me PRP O
...

0人赞添加讨论(0) 举报

混吃等死

3楼-- · 2019-01-25 19:00

The problem you are looking to solve is called most commonly, Named Entity Recognition (NER). There are many algorithms that can help you solve the problem, but the most important notion is that you need to convert your text data into a suitable format for sequence taggers. Here is an example of the BIO format:

I     O
love  O
Paris B-LOC
and   O
New   B-LOC
York  I-LOC
.     O

From there, you can choose to train any type of classifier, such as Naive Bayes, SVM, MaxEnt, CRF, etc. Currently the most popular algorithm for such multi-token sequence classification tasks is CRF. There are available tools that will let you train a BIO model (although originally intended for chunking) from a file using the format shown above (e.g. YamCha, CRF++, CRFSuite, Wapiti). If you are using Python you can look into scikit-learn, python-crfsuite and PyStruct in addition to NLTK.

0人赞添加讨论(0) 举报

Training Tagger with Custom Tags in NLTK

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间