I am trying to highlight important words in imdb dataset which contributed finally to the sentiment analysis prediction .
The dataset is like :
X_train - A review as string .
Y_train - 0 or 1
Now after using Glove embeddings for embedding the X_train value I can feed it to a neural net .
Now my question is , how can I highlight the most important words probability wise ? just like deepmoji.mit.edu ?
What have I tried :
I tried splitting the input sentences into bi-grams and using a 1D CNN to train it . Later when we want to find the important words of X_test , we just split the X_test in bigrams and find their probabilities . It works but not accurate .
I tried using prebuilt Hierarchical Attention Networks and succeeded . I got what I wanted but I couldn't figure out every line and concepts from the code .It's like a black box to me .
I know how a neural net works and I can code it using numpy with manual back propagation from scratch . I have detailed knowledge of how a lstm works and what forget , update , and output gates actually outputs . But I couldn't still figure out how to extract attention weights and how to make the data as a 3D array ( what is the timestep in our 2D data ? )
So , any type of guidance is welcome
Here is a version with Attention (not Hierarchical) but you should be able to figure out how to make it work with hierarchy too - if not I can help out too. The trick is to define 2 models and use 1 for the training (model) and the other one to extract attention values (model_with_attention_output):
# Tensorflow 1.9; Keras 2.2.0 (latest versions)
# should be backwards compatible upto Keras 2.0.9 and tf 1.5
from keras.models import Model
from keras.layers import *
import numpy as np
dictionary_size=1000
def create_models():
#Get a sequence of indexes of words as input:
# Keras supports dynamic input lengths if you provide (None,) as the
# input shape
inp = Input((None,))
#Embed words into vectors of size 10 each:
# Output shape is (None,10)
embs = Embedding(dictionary_size, 10)(inp)
# Run LSTM on these vectors and return output on each timestep
# Output shape is (None,5)
lstm = LSTM(5, return_sequences=True)(embs)
##Attention Block
#Transform each timestep into 1 value (attention_value)
# Output shape is (None,1)
attention = TimeDistributed(Dense(1))(lstm)
#By running softmax on axis 1 we force attention_values
# to sum up to 1. We are effectively assigning a "weight" to each timestep
# Output shape is still (None,1) but each value changes
attention_vals = Softmax(axis=1)(attention)
# Multiply the encoded timestep by the respective weight
# I.e. we are scaling each timestep based on its weight
# Output shape is (None,5): (None,5)*(None,1)=(None,5)
scaled_vecs = Multiply()([lstm,attention_vals])
# Sum up all scaled timesteps into 1 vector
# i.e. obtain a weighted sum of timesteps
# Output shape is (5,) : Observe the time dimension got collapsed
context_vector = Lambda(lambda x: K.sum(x,axis=1))(scaled_vecs)
##Attention Block over
# Get the output out
out = Dense(1,activation='sigmoid')(context_vector)
model = Model(inp, out)
model_with_attention_output = Model(inp, [out, attention_vals])
model.compile(optimizer='adam',loss='binary_crossentropy')
return model, model_with_attention_output
model,model_with_attention_output = create_models()
model.fit(np.array([[1,2,3]]),[1],batch_size=1)
print ('Attention Over each word: ',model_with_attention_output.predict(np.array([[1,2,3]]),batch_size=1)[1])
The output will be the numpy array with attention value of each word - the higher the value the more important the word was
EDIT: You might want to replace lstm in multiplication with embs to get better interpretations but it will lead to worse performance...