TreebankLanguagePack function in Neural Network De

If I want to train the Stanford Neural Network Dependency Parser for another language, there is a need for a "treebankLanguagePack"(TLP) but the information about this TLP is very limited:

particularities of your treebank and the language it contains

If I have my "treebank" in another language that follows the same format as PTB, and my data is using CONLL format. The dependency format follows the "Universal Dependency" UD. Do I need this TLP?

标签： parsing neural-network stanford-nlp

1条回答

冷血范

2楼-- · 2019-04-02 15:21

As of the current CoreNLP release, the TreebankLanguagePack is used within the dependency parser only to 1) determine the input text encoding and 2) determine which tokens count as punctuation [1].

Your best bet for a quick solution, then, is probably to stick with the UD English TreebankLanguagePack. You should do this by specifying the property language as "UniversalEnglish" (whether you're accessing the dependency parser via code or command line). If you're using the dependency parser via the CoreNLP main entry point, this property key should be depparse.language.

Technical details

Two very subtle details follow. You probably don't need to worry about these if you're just trying to hack something together at first, but it's probably good to mention so that you can avoid apocalyptic / head-smashing bugs in the future.

Evaluation and punctuation: If you do choose to stick with UniversalEnglish, be aware that there is a hack in the evaluation code that overrides the punctuation set for English parsing in particular. Any changes you make to punctuation in PennTreebankLanguagePack (the TLP used for the UniversalEnglish language) will be ignored! If you need to get around this, it should be enough to copy and paste the PennTreebankLanguagePack into your own codebase and name it something different.
Potential memory leak: When building parse results to be returned to the user, the dependency parser draws from a pool of cached GrammaticalRelation objects. This cache does not live-update. This means that if you have relations which aren't formally defined in the language you specified via the language property, they will lead to the instantiation of a new object whenever those relations show up in parser predictions. (This can be a big deal memory-wise if you happen to store the parse objects somewhere.)

[1]: Punctuation is excluded during evaluation. This is a standard "cheat" used throughout the dependency parsing literature.

0人赞添加讨论(0) 举报

TreebankLanguagePack function in Neural Network De

Technical details

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间