I've trained a sentiment classifier model using Keras library by following the below steps(broadly).
- Convert Text corpus into sequences using Tokenizer object/class
- Build a model using the model.fit() method
- Evaluate this model
Now for scoring using this model, I was able to save the model to a file and load from a file. However I've not found a way to save the Tokenizer object to file. Without this I'll have to process the corpus every time I need to score even a single sentence. Is there a way around this?
The most common way is to use either pickle
or joblib
. Here you have an example on how to use pickle
in order to save Tokenizer
:
import pickle
# saving
with open('tokenizer.pickle', 'wb') as handle:
pickle.dump(tokenizer, handle, protocol=pickle.HIGHEST_PROTOCOL)
# loading
with open('tokenizer.pickle', 'rb') as handle:
tokenizer = pickle.load(handle)
The accepted answer clearly demonstrates how to save the tokenizer. The following is a comment on the problem of (generally) scoring after fitting or saving. Suppose that a list texts
is comprised of two lists Train_text
and Test_text
, where the set of tokens in Test_text
is a subset of the set of tokens in Train_text
(an optimistic assumption). Then fit_on_texts(Train_text)
gives different results for texts_to_sequences(Test_text)
as compared with first calling fit_on_texts(texts)
and then text_to_sequences(Test_text)
.
Concrete Example:
from keras.preprocessing.text import Tokenizer
docs = ["A heart that",
"full up like",
"a landfill",
"no surprises",
"and no alarms"
"a job that slowly"
"Bruises that",
"You look so",
"tired happy",
"no alarms",
"and no surprises"]
docs_train = docs[:7]
docs_test = docs[7:]
# EXPERIMENT 1: FIT TOKENIZER ONLY ON TRAIN
T_1 = Tokenizer()
T_1.fit_on_texts(docs_train) # only train set
encoded_train_1 = T_1.texts_to_sequences(docs_train)
encoded_test_1 = T_1.texts_to_sequences(docs_test)
print("result for test 1:\n%s" %(encoded_test_1,))
# EXPERIMENT 2: FIT TOKENIZER ON BOTH TRAIN + TEST
T_2 = Tokenizer()
T_2.fit_on_texts(docs) # both train and test set
encoded_train_2 = T_2.texts_to_sequences(docs_train)
encoded_test_2 = T_2.texts_to_sequences(docs_test)
print("result for test 2:\n%s" %(encoded_test_2,))
Results:
result for test 1:
[[3], [10, 3, 9]]
result for test 2:
[[1, 19], [5, 1, 4]]
Of course, if the above optimistic assumption is not satisfied and the set of tokens in Test_text is disjoint from that of Train_test, then test 1 results in a list of empty brackets [].
I've created the issue https://github.com/keras-team/keras/issues/9289 in the keras Repo. Until the API is changed, the issue has a link to a gist that has code to demonstrate how to save and restore a tokenizer without having the original documents the tokenizer was fit on. I prefer to store all my model information in a JSON file (because reasons, but mainly mixed JS/Python environment), and this will allow for that, even with sort_keys=True