How to handle <UKN> tokens in text generation

In my text generation dataset, I have converted all infrequent words into the token (unknown word), as suggested by most text-generation literature.

However, when training an RNN to take in part of a sentence as input and predict the rest of the sentence, I am not sure how I should stop the network from generating tokens. When the network encounters an unknown (infrequent) word in the training set, what should its output be?

Example:
Sentence: I went to the mall and bought a <ukn> and some groceries
Network input: I went to the mall and bought a
Current network output: <unk> and some groceries
Desired network output: ??? and some groceries

What should it be outputting instead of the <unk>?

I don't want to build a generator that outputs words it does not know.

标签： machine-learning neural-network nlp word2vec recurrent-neural-network

2条回答

混吃等死

2楼-- · 2019-09-23 07:32

I've seen <UNK> occasionally, but never <UKN>.

Even more common in word-embedding-training is dropping rare words entirely, to keep vocabularies compact, and avoid having words-without-sufficient-examples from serving as 'noise' in the training of other words. (Folding them all into a single magic unknown-token – which then becomes more frequent than real tokens! – would just tend to throw a big unnatural pseudo-word with no clear meaning into every other word's contexts.)

So, I'm not sure it's accurate to describe this as "suggested by most text-generation literature". And to the extent it might be, wouldn't any source suggesting this then also suggest what-to-do when a prediction is the UNK token?

If your specific application needed any real known word instead, even if the NN has low confidence that the right word is any known-word, it would seem you'd just read the next-best-non-<UKN> prediction from the NN, as suggested by @petezurich's answer.

0人赞添加讨论(0) 举报

做自己的国王

3楼-- · 2019-09-23 07:37

A RNN will give you a sampling of tokens that are most likely to appear next in your text. In your code you choose the token with the highest probability, in this case «unk».

In this case you can omit the «ukn» token and simply take the next most likely token that the RNN suggests based on the probability values that it renders.

0人赞添加讨论(0) 举报

How to handle tokens in text generation

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间