I am trying to add attention mechanism to stacked LSTMs implementation https://github.com/salesforce/awd-lstm-lm
All examples online use encoder-decoder architecture, which I do not want to use (do I have to for the attention mechanism?).
Basically, I have used https://webcache.googleusercontent.com/search?q=cache:81Q7u36DRPIJ:https://github.com/zhedongzheng/finch/blob/master/nlp-models/pytorch/rnn_attn_text_clf.py+&cd=2&hl=en&ct=clnk&gl=uk
def __init__(self, rnn_type, ntoken, ninp, nhid, nlayers, dropout=0.5, dropouth=0.5, dropouti=0.5, dropoute=0.1, wdrop=0, tie_weights=False):
super(RNNModel, self).__init__()
self.encoder = nn.Embedding(ntoken, ninp)
self.rnns = [torch.nn.LSTM(ninp if l == 0 else nhid, nhid if l != nlayers - 1 else (ninp if tie_weights else nhid), 1, dropout=0) for l in range(nlayers)]
for rnn in self.rnns:
rnn.linear = WeightDrop(rnn.linear, ['weight'], dropout=wdrop)
self.rnns = torch.nn.ModuleList(self.rnns)
self.attn_fc = torch.nn.Linear(ninp, 1)
self.decoder = nn.Linear(nhid, ntoken)
self.init_weights()
def attention(self, rnn_out, state):
state = torch.transpose(state, 1,2)
weights = torch.bmm(rnn_out, state)# torch.bmm(rnn_out, state)
weights = torch.nn.functional.softmax(weights)#.squeeze(2)).unsqueeze(2)
rnn_out_t = torch.transpose(rnn_out, 1, 2)
bmmed = torch.bmm(rnn_out_t, weights)
bmmed = bmmed.squeeze(2)
return bmmed
def forward(self, input, hidden, return_h=False, decoder=False, encoder_outputs=None):
emb = embedded_dropout(self.encoder, input, dropout=self.dropoute if self.training else 0)
emb = self.lockdrop(emb, self.dropouti)
new_hidden = []
raw_outputs = []
outputs = []
for l, rnn in enumerate(self.rnns):
temp = []
for item in emb:
item = item.unsqueeze(0)
raw_output, new_h = rnn(item, hidden[l])
raw_output = self.attention(raw_output, new_h[0])
temp.append(raw_output)
raw_output = torch.stack(temp)
raw_output = raw_output.squeeze(1)
new_hidden.append(new_h)
raw_outputs.append(raw_output)
if l != self.nlayers - 1:
raw_output = self.lockdrop(raw_output, self.dropouth)
outputs.append(raw_output)
hidden = new_hidden
output = self.lockdrop(raw_output, self.dropout)
outputs.append(output)
outputs = torch.stack(outputs).squeeze(0)
outputs = torch.transpose(outputs, 2,1)
output = output.transpose(2,1)
output = output.contiguous()
decoded = self.decoder(output.view(output.size(0)*output.size(1), output.size(2)))
result = decoded.view(output.size(0), output.size(1), decoded.size(1))
if return_h:
return result, hidden, raw_outputs, outputs
return result, hidden
This model is training, but my loss is quite high as compared to the model without the attention model.
I understood your question but it is a bit tough to follow your code and find the reason why the loss is not decreasing. Also, it is not clear why you want to compare the last hidden state of the RNN with all the hidden states at every time step.
Please note, a particular trick/mechanism is useful if you use it in the correct way.
The way you are trying to use attention mechanism, I am not sure if it is the correct way. So, don't expect that since you are using attention trick in your model, you will get good results!! You should think, why attention mechanism will bring advantage to your desired task?
You didn't clearly mention what is that task you are targetting? Since you have pointed to a repo which contains code on language modeling, I am guessing the task is: given a sequence of tokens, predict the next token.
One possible problem I can see in your code is: in the for item in emb:
loop, you will always use the embedddings as input to each LSTM layer, so having a stacked LSTM doesn't make sense to me.
Now, let me first answer your question and then show step-by-step how can you build your desired NN architecture.
Do I need to use encoder-decoder architecture to use attention mechanism?
The encoder-decoder architecture is better known as sequence-to-sequence to learning and it is widely used in many generation task, for example, machine translation. The answer to your question is no, you are not required to use any specific neural network architecture to use attention mechanism.
The structure you presented in the figure is little ambiguous but should be easy to implement. Since your implementation is not clear to me, I am trying to guide you to a better way of implementing it. For the following discussion, I am assuming we are dealing with text inputs.
Let's say, we have an input of shape 16 x 10
where 16
is batch_size
and 10
is seq_len
. We can assume we have 16 sentences in a mini-batch and each sentence length is 10.
batch_size, vocab_size = 16, 100
mat = np.random.randint(vocab_size, size=(batch_size, 10))
input_var = Variable(torch.from_numpy(mat))
Here, 100
can be considered as the vocabulary size. It is important to note that throughout the example I am providing, I am assuming batch_size
as the first dimension in all respective tensors/variables.
Now, let's embed the input variable.
embedding = nn.Embedding(100, 50)
embed = embedding(input_var)
After embedding, we got a variable of shape 16 x 10 x 50
where 50
is the embedding size.
Now, let's define a 2-layer unidirectional LSTM with 100 hidden units at each layer.
rnns = nn.ModuleList()
nlayers, input_size, hidden_size = 2, 50, 100
for i in range(nlayers):
input_size = input_size if i == 0 else hidden_size
rnns.append(nn.LSTM(input_size, hidden_size, 1, batch_first=True))
Then, we can feed our input to this 2-layer LSTM to get the output.
sent_variable = embed
outputs, hid = [], []
for i in range(nlayers):
if i != 0:
sent_variable = F.dropout(sent_variable, p=0.3, training=True)
output, hidden = rnns[i](sent_variable)
outputs.append(output)
hid.append(hidden[0].squeeze(0))
sent_variable = output
rnn_out = torch.cat(outputs, 2)
hid = torch.cat(hid, 1)
Now, you can simply use the hid
to predict the next word. I would suggest you do that. Here, shape of hid
is batch_size x (num_layers*hidden_size)
.
But since you want to use attention to compute soft alignment score between last hidden states with each hidden states produced by LSTM layers, let's do this.
sent_variable = embed
hid, con = [], []
for i in range(nlayers):
if i != 0:
sent_variable = F.dropout(sent_variable, p=0.3, training=True)
output, hidden = rnns[i](sent_variable)
sent_variable = output
hidden = hidden[0].squeeze(0) # batch_size x hidden_size
hid.append(hidden)
weights = torch.bmm(output[:, 0:-1, :], hidden.unsqueeze(2)).squeeze(2)
soft_weights = F.softmax(weights, 1) # batch_size x seq_len
context = torch.bmm(output[:, 0:-1, :].transpose(1, 2), soft_weights.unsqueeze(2)).squeeze(2)
con.append(context)
hid, con = torch.cat(hid, 1), torch.cat(con, 1)
combined = torch.cat((hid, con), 1)
Here, we compute soft alignment score between the last state with all the states of each time step. Then we compute a context vector which is just a linear combination of all the hidden states. We combine them to form a single representation.
Please note, I have removed the last hidden states from output
: output[:, 0:-1, :]
since you are comparing with last hidden state itself.
The final combined
representation stores the last hidden states and context vectors produced at each layer. You can directly use this representation to predict the next word.
Predicting the next word is straight-forward and as you are using a simple linear layer is just fine.
Edit: We can do the following to predict the next word.
decoder = nn.Linear(nlayers * hidden_size * 2, vocab_size)
dec_out = decoder(combined)
Here, the shape of dec_out
is batch_size x vocab_size
. Now, we can compute negative log-likelihood loss which will be used to backpropagate later.
Before computing the negative log-likelihood loss, we need to apply log_softmax
to the output of the decoder.
dec_out = F.log_softmax(dec_out, 1)
target = np.random.randint(vocab_size, size=(batch_size))
target = Variable(torch.from_numpy(target))
And, we also defined the target which is required to compute the loss. See NLLLoss for details. So, now we can compute the loss as follows.
criterion = nn.NLLLoss()
loss = criterion(dec_out, target)
print(loss)
The printed loss value is:
Variable containing:
4.6278
[torch.FloatTensor of size 1]
Hope the entire explanation helps you!!
The whole point of attention, is that word order in different languages is different and thus when decoding the 5th word in the target language you might need to pay attention to the 3rd word (or encoding of the 3rd word) in the source language because these are the words which correspond to each other. That is why you mostly see attention used with an encoder decoder structure.
If I understand correctly, you are doing next word prediction? In that case it might still make sense to use attention because the next word might highly depend on the word 4 steps in the past.
So basically what you need is:
rnn: which takes in input
of shape MBxninp
and hidden
of shape MBxnhid
and outputs h
of shape MBxnhid
.
h, next_hidden = rnn(input, hidden)
attention: which takes in sequence of h
's and the last h_last
decides how each of them is important by giving each a weight w
.
w = attention(hs, h_last)
where w
is of shape seq_len x MB x 1
, hs
is of shape seq_len x MB x nhid
, and h_last
is of shape MB x nhid
.
Now you weight the hs
by w
:
h_att = torch.sum(w*hs, dim=0) #shape MB x n_hid
Now the point is you need to do that for every time step:
h_att_list = []
h_list = []
hidden = hidden_init
for word in embedded_words:
h, hidden = rnn(word, hidden)
h_list.append(h)
h_att = attention(torch.stack(h_list), h)
h_att_list.append(h_att)
And then you can apply the decoder (which might need to be an MLP rather than just a linear transformation) on h_att_list
.