I use the TensorFlow high-level API tf.learn
to train and evaluate a DNN classifier for a series of binary text classifications (actually I need multi-label classification but at the moment I check every label separately). My code is very similar to the tf.learn Tutorial
classifier = tf.contrib.learn.DNNClassifier(
hidden_units=[10],
n_classes=2,
dropout=0.1,
feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(training_set.data))
classifier.fit(x=training_set.data, y=training_set.target, steps=100)
val_accuracy_score = classifier.evaluate(x=validation_set.data, y=validation_set.target)["accuracy"]
Accuracy score varies roughly from 54% to 90%, with 21 documents in the validation (test) set which are always the same.
What does the very significant deviation mean? I understand there are some random factors (eg. dropout), but to my understanding the model should converge towards an optimum.
I use words (lemmas), bi- and trigrams, sentiment scores and LIWC scores as features, so I do have a very high-dimensional feature space, with only 28 training and 21 validation documents. Can this cause problems? How can I consistently improve the results apart from collecting more training data?
Update: To clarify, I generate a dictionary of occurring words and n-grams and discard those that occur only 1 time, so I only use words (n-grams) that exist in the corpus.