If I want to train the Stanford Neural Network Dependency Parser for another language, there is a need for a "treebankLanguagePack"(TLP) but the information about this TLP is very limited:
particularities of your treebank and the language it contains
If I have my "treebank" in another language that follows the same format as PTB, and my data is using CONLL format. The dependency format follows the "Universal Dependency" UD. Do I need this TLP?
As of the current CoreNLP release, the TreebankLanguagePack is used within the dependency parser only to 1) determine the input text encoding and 2) determine which tokens count as punctuation [1].
Your best bet for a quick solution, then, is probably to stick with the UD English TreebankLanguagePack. You should do this by specifying the property
language
as"UniversalEnglish"
(whether you're accessing the dependency parser via code or command line). If you're using the dependency parser via the CoreNLP main entry point, this property key should bedepparse.language
.Technical details
Two very subtle details follow. You probably don't need to worry about these if you're just trying to hack something together at first, but it's probably good to mention so that you can avoid apocalyptic / head-smashing bugs in the future.
PennTreebankLanguagePack
(the TLP used for the UniversalEnglish language) will be ignored! If you need to get around this, it should be enough to copy and paste thePennTreebankLanguagePack
into your own codebase and name it something different.GrammaticalRelation
objects. This cache does not live-update. This means that if you have relations which aren't formally defined in the language you specified via thelanguage
property, they will lead to the instantiation of a new object whenever those relations show up in parser predictions. (This can be a big deal memory-wise if you happen to store the parse objects somewhere.)[1]: Punctuation is excluded during evaluation. This is a standard "cheat" used throughout the dependency parsing literature.