I am trying to add attention mechanism to stacked LSTMs implementation https://github.com/salesforce/awd-lstm-lm
All examples online use encoder-decoder architecture, which I do not want to use (do I have to for the attention mechanism?).
Basically, I have used https://webcache.googleusercontent.com/search?q=cache:81Q7u36DRPIJ:https://github.com/zhedongzheng/finch/blob/master/nlp-models/pytorch/rnn_attn_text_clf.py+&cd=2&hl=en&ct=clnk&gl=uk
def __init__(self, rnn_type, ntoken, ninp, nhid, nlayers, dropout=0.5, dropouth=0.5, dropouti=0.5, dropoute=0.1, wdrop=0, tie_weights=False):
super(RNNModel, self).__init__()
self.encoder = nn.Embedding(ntoken, ninp)
self.rnns = [torch.nn.LSTM(ninp if l == 0 else nhid, nhid if l != nlayers - 1 else (ninp if tie_weights else nhid), 1, dropout=0) for l in range(nlayers)]
for rnn in self.rnns:
rnn.linear = WeightDrop(rnn.linear, ['weight'], dropout=wdrop)
self.rnns = torch.nn.ModuleList(self.rnns)
self.attn_fc = torch.nn.Linear(ninp, 1)
self.decoder = nn.Linear(nhid, ntoken)
self.init_weights()
def attention(self, rnn_out, state):
state = torch.transpose(state, 1,2)
weights = torch.bmm(rnn_out, state)# torch.bmm(rnn_out, state)
weights = torch.nn.functional.softmax(weights)#.squeeze(2)).unsqueeze(2)
rnn_out_t = torch.transpose(rnn_out, 1, 2)
bmmed = torch.bmm(rnn_out_t, weights)
bmmed = bmmed.squeeze(2)
return bmmed
def forward(self, input, hidden, return_h=False, decoder=False, encoder_outputs=None):
emb = embedded_dropout(self.encoder, input, dropout=self.dropoute if self.training else 0)
emb = self.lockdrop(emb, self.dropouti)
new_hidden = []
raw_outputs = []
outputs = []
for l, rnn in enumerate(self.rnns):
temp = []
for item in emb:
item = item.unsqueeze(0)
raw_output, new_h = rnn(item, hidden[l])
raw_output = self.attention(raw_output, new_h[0])
temp.append(raw_output)
raw_output = torch.stack(temp)
raw_output = raw_output.squeeze(1)
new_hidden.append(new_h)
raw_outputs.append(raw_output)
if l != self.nlayers - 1:
raw_output = self.lockdrop(raw_output, self.dropouth)
outputs.append(raw_output)
hidden = new_hidden
output = self.lockdrop(raw_output, self.dropout)
outputs.append(output)
outputs = torch.stack(outputs).squeeze(0)
outputs = torch.transpose(outputs, 2,1)
output = output.transpose(2,1)
output = output.contiguous()
decoded = self.decoder(output.view(output.size(0)*output.size(1), output.size(2)))
result = decoded.view(output.size(0), output.size(1), decoded.size(1))
if return_h:
return result, hidden, raw_outputs, outputs
return result, hidden
This model is training, but my loss is quite high as compared to the model without the attention model.
I understood your question but it is a bit tough to follow your code and find the reason why the loss is not decreasing. Also, it is not clear why you want to compare the last hidden state of the RNN with all the hidden states at every time step.
Please note, a particular trick/mechanism is useful if you use it in the correct way. The way you are trying to use attention mechanism, I am not sure if it is the correct way. So, don't expect that since you are using attention trick in your model, you will get good results!! You should think, why attention mechanism will bring advantage to your desired task?
You didn't clearly mention what is that task you are targetting? Since you have pointed to a repo which contains code on language modeling, I am guessing the task is: given a sequence of tokens, predict the next token.
One possible problem I can see in your code is: in the
for item in emb:
loop, you will always use the embeffffdings as input to each LSTM layer, so having a stacked LSTM doesn't make sense to me.Now, let me first answer your question and then show step-by-step how can you build your desired NN architecture.
The encoder-decoder architecture is better known as sequence-to-sequence to learning and it is widely used in many generation task, for example, machine translation. The answer to your question is no, you are not required to use any specific neural network architecture to use attention mechanism.
The structure you presented in the figure is little ambiguous but should be easy to implement. Since your implementation is not clear to me, I am trying to guide you to a better way of implementing it. For the following discussion, I am assuming we are dealing with text inputs.
Let's say, we have an input of shape
16 x 10
where16
isbatch_size
and10
isseq_len
. We can assume we have 16 sentences in a mini-batch and each sentence length is 10.Here,
100
can be considered as the vocabulary size. It is important to note that throughout the example I am providing, I am assumingbatch_size
as the first dimension in all respective tensors/variables.Now, let's embed the input variable.
After embedding, we got a variable of shape
16 x 10 x 50
where50
is the embedding size.Now, let's define a 2-layer unidirectional LSTM with 100 hidden units at each layer.
Then, we can feed our input to this 2-layer LSTM to get the output.
Now, you can simply use the
hid
to predict the next word. I would suggest you do that. Here, shape ofhid
isbatch_size x (num_layers*hidden_size)
.But since you want to use attention to compute soft alignment score between last hidden states with each hidden states produced by LSTM layers, let's do this.
Here, we compute soft alignment score between the last state with all the states of each time step. Then we compute a context vector which is just a linear combination of all the hidden states. We combine them to form a single representation.
Please note, I have removed the last hidden states from
output
:output[:, 0:-1, :]
since you are comparing with last hidden state itself.The final
combined
representation stores the last hidden states and context vectors produced at each layer. You can directly use this representation to predict the next word.Predicting the next word is straight-forward and as you are using a simple linear layer is just fine.
Edit: We can do the following to predict the next word.
Here, the shape of
dec_out
isbatch_size x vocab_size
. Now, we can compute negative log-likelihood loss which will be used to backpropagate later.Before computing the negative log-likelihood loss, we need to apply
log_softmax
to the output of the decoder.And, we also defined the target which is required to compute the loss. See NLLLoss for details. So, now we can compute the loss as follows.
The printed loss value is:
Hope the entire explanation helps you!!
The whole point of attention, is that word order in different languages is different and thus when decoding the 5th word in the target language you might need to pay attention to the 3rd word (or encoding of the 3rd word) in the source language because these are the words which correspond to each other. That is why you mostly see attention used with an encoder decoder structure.
If I understand correctly, you are doing next word prediction? In that case it might still make sense to use attention because the next word might highly depend on the word 4 steps in the past.
So basically what you need is:
rnn: which takes in
input
of shapeMBxninp
andhidden
of shapeMBxnhid
and outputsh
of shapeMBxnhid
.attention: which takes in sequence of
h
's and the lasth_last
decides how each of them is important by giving each a weightw
.where
w
is of shapeseq_len x MB x 1
,hs
is of shapeseq_len x MB x nhid
, andh_last
is of shapeMB x nhid
.Now you weight the
hs
byw
:Now the point is you need to do that for every time step:
And then you can apply the decoder (which might need to be an MLP rather than just a linear transformation) on
h_att_list
.