Initializing Out of Vocabulary (OOV) tokens

2019-07-30 18:29发布

I am building TensorFlow model for NLP task, and I am using pretrained Glove 300d word-vector/embedding dataset.

Obviously some tokens can't be resolved as embeddings, because were not included into training dataset for word vector embedding model, e.g. rare names.

I can replace those tokens with vectors of 0s, but rather than dropping this information on the floor, I prefer to encode it somehow and include to my training data.

Say, I have 'raijin' word, which can't be resolved as embedding vector, what would be the best way to encode it consistently with Glove embedding dataset? What is the best approach to convert it to 300d vector?

Thank you.

标签： tensorflow embedding word-embedding

2条回答

看我几分像从前

2楼-- · 2019-07-30 19:13

It's good to have a look at EMNLP paper on handling 'oov' tokens by generating embeddings

Mimicking Word Embeddings using Subword RNNs

0人赞添加讨论(0) 举报

Summer. ? 凉城

3楼-- · 2019-07-30 19:28

Instead of assigning all the Out of Vocabulary tokens to a common UNK vector (zeros), it is better to assign them a unique random vector. At-least this way when you find the similarity between them with any other word, each of them will be unique and the model can learn something out of it. In the UNK case, they will all be same and so all the UNK words will be treated as having the same context.

I tried this approach and got a 3% accuracy improvement on the Quora Duplicate question pair detection dataset using an LSTM model.

0人赞添加讨论(0) 举报

Initializing Out of Vocabulary (OOV) tokens

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间