I was looking at the Keras IMDB Movie reviews sentiment classification example (and the corresponding model on github), which learns to decide whether a review is positive or negative.
The data has been preprocessed such that each review is encoded as a sequence of integers, e.g. the review "This movie is awesome!" would be [11, 17, 6, 1187]
and for this input the model gives the output 'positive'.
The dataset also makes available the word index used for encoding the sequences, i.e. I know the map
This: 11
movie: 17
is: 6
awesome: 1187
...
Can I somehow include this knowledge into the model such that its input is a string, i.e. it gives a prediction based on the input "This movie is awesome!"?
First up, the input to the neural network is never a string, it's exactly a list of indices of words (or characters) in a vocabulary. And the first thing the model usually does is embedding transformation (see the example) which further converts these indices into the (trainable) float vectors.
What you really mean is data pre-processing step that transforms the raw input from the user (can be text, image pixels, sound recording, etc) into a format that is suitable and convenient for the model. Data pre-processing is an essential part of the machine-learning application just like the model itself, and should be stored separately. If you intend to work with imdb dataset, the vocabulary is already pre-processed. You can call
imdb.get_word_index()
in keras to get the word index or you can work with the vocabulary json file directly.