I have a document with tagged data in the format Hi here's my [KEYWORD phone number], let me know when you wanna hangout: [PHONE 7802708523]. I live in a [PROP_TYPE condo] in [CITY New York]
. I want to train a model based on a set of these type of tagged documents, and then use my model to tag new documents. Is this possible in NLTK? I have looked at chunking and NLTK-Trainer scripts, but these have a restricted set of tags and corpora, while my dataset has custom tags.
相关问题
- How to get a list of antonyms lemmas using Python,
- How to match dependency patterns with spaCy?
- LUIS - Can we use phrases list for new values in t
- Use concordance to find hyphenated words
- How to initialize a `Doc` in textacy 0.6.2?
相关文章
- Creating a dictionary for each word in a file and
- What's the difference between WordNet 3.1 and
- How should I vectorize the following list of lists
- What created `maxent_treebank_pos_tagger/english.p
- How do I use NLTK's default tokenizer to get s
- Computing precision and recall for two sets of key
- Stanford Parser and NLTK windows
- Storing conditional frequency distribution using N
As @AleksandarSavkov wrote already, this is essentially a named entity recognition (NER) task-- or more generally a chunking task, as you already realize. How to do it is covered nicely in chapter 7 of the NLTK book. I recommend you ignore the sections on regexp tagging and use the approach in section 3, Developing and evaluating chunkers. It includes code samples you can use verbatim to create a chunker (the
ConsecutiveNPChunkTagger
). Your responsibility is to select features that will give you good performance.You'll need to transform your data into the IOB format expected by the NLTK's architecture; it expects part of speech tags, so the first step should be to run your input through a POS tagger;
nltk.pos_tag()
will do a decent enough job (once you strip off markup like[KEYWORD ...]
), and requires no additional software to be installed. When your corpus is in the following format (word -- POS-tag -- IOB-tag), you are ready to train a recognizer:The problem you are looking to solve is called most commonly, Named Entity Recognition (NER). There are many algorithms that can help you solve the problem, but the most important notion is that you need to convert your text data into a suitable format for sequence taggers. Here is an example of the BIO format:
From there, you can choose to train any type of classifier, such as Naive Bayes, SVM, MaxEnt, CRF, etc. Currently the most popular algorithm for such multi-token sequence classification tasks is CRF. There are available tools that will let you train a BIO model (although originally intended for chunking) from a file using the format shown above (e.g. YamCha, CRF++, CRFSuite, Wapiti). If you are using Python you can look into scikit-learn, python-crfsuite and PyStruct in addition to NLTK.