The docs for an Embedding Layer in Keras say:
Turns positive integers (indexes) into dense vectors of fixed size. eg. [[4], [20]]
-> [[0.25, 0.1], [0.6, -0.2]]
I believe this could also be achieved by encoding the inputs as one-hot vectors of length vocabulary_size
, and feeding them into a Dense Layer.
Is an Embedding Layer merely a convenience for this two-step process, or is something fancier going on under the hood?
Mathematically, the difference is this:
An embedding layer performs select operation. In keras, this layer is equivalent to:
K.gather(self.embeddings, inputs) # just one matrix
A dense layer performs dot-product operation, plus an optional activation:
outputs = matmul(inputs, self.kernel) # a kernel matrix
outputs = bias_add(outputs, self.bias) # a bias vector
return self.activation(outputs) # an activation function
You can emulate an embedding layer with fully-connected layer via one-hot encoding, but the whole point of dense embedding is to avoid one-hot representation. In NLP, the word vocabulary size can be of the order 100k (sometimes even a million). On top of that, it's often needed to process the sequences of words in a batch. Processing the batch of sequences of word indices would be much more efficient than the batch of sequences of one-hot vectors. In addition, gather
operation itself is faster than matrix dot-product, both in forward and backward pass.