Stanford PTBTokenizer token's split delimiter

2019-09-03 00:05发布

There is a way to provide to the PTBTokenizer a set of delimiters characters to split a token ?

i was testing the behaviour of this tokenizer and i've realized that there are some characters like the vertical bar '|' for which the tokenizer diviedes a substring into two token, and others like the slash or the hypen for which the tokenizer return a single token.

标签： tokenize stanford-nlp

1条回答

Rolldiameter

2楼-- · 2019-09-03 00:34

There's not any simple way to do this with the PTBTokenizer, no. You can do some pre-processing and post-processing to get what you want, though there are two concerns worth mentioning:

All models distributed with CoreNLP are trained on the standard tokenizer behavior. If you change how the input to these later components are tokenized, there's no guarantee that these components will work predictably.
If you do enough pre- and post-processing (and aren't using any later components as mentioned in #1), it may be simpler to just steal the PTBTokenizer implementation and write your own.

(There is a similar question on customizing apostrophe tokenization behavior: Stanford coreNLP - split words ignoring apostrophe.)

0人赞添加讨论(0) 举报

Stanford PTBTokenizer token's split delimiter

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间