how do I create my own training corpus for stanfor

I have to analyze informal english text with lots of short hands and local lingo. Hence I was thinking of creating the model for the stanford tagger.

How do i create my own set of labelled corpus for the stanford tagger to train on?

What is the syntax of the corpus and how long should my corpus be in order to achieve a desirable performance?

标签： java nlp stanford-nlp

4条回答

够拽才男人

2楼-- · 2020-05-26 02:14

To train the PoS tagger, see this mailing list post which is also included in the JavaDocs for the MaxentTagger class.

The javadocs for the edu.stanford.nlp.tagger.maxent.Train class specifies the training format:

The training file should be in the following format: one word and one tag per line separated by a space or a tab. Each sentence should end in an EOS word-tag pair. (Actually, I'm not entirely sure that is still the case, but it probably won't hurt. -wmorgan)

0人赞添加讨论(0) 举报

淡お忘

3楼-- · 2020-05-26 02:17

For the Stanford Parser, you use Penn treebank format, and see Stanford's FAQ about the exact commands to use. The JavaDocs for the LexicalizedParser class also give appropriate commands, particularly:

java -mx1500m edu.stanford.nlp.parser.lexparser.LexicalizedParser [-v] \
   -train trainFilesPath fileRange
   -saveToSerializedFile serializedGrammarFilename

0人赞添加讨论(0) 举报

可以哭但决不认输i

4楼-- · 2020-05-26 02:21

I tried: java -mx1500m edu.stanford.nlp.parser.lexparser.LexicalizedParser [-v] \ -train trainFilesPath fileRange -saveToSerializedFile serializedGrammarFilename

But I had the error:

Error: Could not find or load main class edu.stanford.nlp.parser.lexparser.LexicalizedParser

0人赞添加讨论(0) 举报

Fickle 薄情

5楼-- · 2020-05-26 02:36

Essentially, the texts that you format for the training process should have one token on each line, followed by a tab, followed by an identifier. The identifier may be something like "LOC" for location, "COR" for corporation, or "0" for non-entity tokens. E.g.

I     0
left     0
my     0
heart     0
in     0
Kansas     LOC
City     LOC
.     0

When our team trained a series of classifiers, we fed each a training file formatted like this with roughly 180,000 tokens, and we saw a net improvement in precision but a net decrease in recall. (It bears noting that the increase in precision was not statistically significant.) In case it might be useful to others, I described the process we used to train the classifier as well as the p, r, and f1 values of both trained and default classifiers here.

0人赞添加讨论(0) 举报

how do I create my own training corpus for stanfor

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间