The docs for an Embedding Layer in Keras say:
Turns positive integers (indexes) into dense vectors of fixed size. eg.
[[4], [20]]
->[[0.25, 0.1], [0.6, -0.2]]
I believe this could also be achieved by encoding the inputs as one-hot vectors of length vocabulary_size
, and feeding them into a Dense Layer.
Is an Embedding Layer merely a convenience for this two-step process, or is something fancier going on under the hood?
Mathematically, the difference is this:
An embedding layer performs select operation. In keras, this layer is equivalent to:
A dense layer performs dot-product operation, plus an optional activation:
You can emulate an embedding layer with fully-connected layer via one-hot encoding, but the whole point of dense embedding is to avoid one-hot representation. In NLP, the word vocabulary size can be of the order 100k (sometimes even a million). On top of that, it's often needed to process the sequences of words in a batch. Processing the batch of sequences of word indices would be much more efficient than the batch of sequences of one-hot vectors. In addition,
gather
operation itself is faster than matrix dot-product, both in forward and backward pass.