I am training a handwriting recognition model of this architecture:
{
"network": [
{
"layer_type": "l2_normalize"
},
{
"layer_type": "conv2d",
"num_filters": 16,
"kernel_size": 5,
"stride": 1,
"padding": "same"
},
{
"layer_type": "max_pool2d",
"pool_size": 2,
"stride": 2,
"padding": "same"
},
{
"layer_type": "l2_normalize"
},
{
"layer_type": "dropout",
"keep_prob": 0.5
},
{
"layer_type": "conv2d",
"num_filters": 32,
"kernel_size": 5,
"stride": 1,
"padding": "same"
},
{
"layer_type": "max_pool2d",
"pool_size": 2,
"stride": 2,
"padding": "same"
},
{
"layer_type": "l2_normalize"
},
{
"layer_type": "dropout",
"keep_prob": 0.5
},
{
"layer_type": "conv2d",
"num_filters": 64,
"kernel_size": 5,
"stride": 1,
"padding": "same"
},
{
"layer_type": "max_pool2d",
"pool_size": 2,
"stride": 2,
"padding": "same"
},
{
"layer_type": "l2_normalize"
},
{
"layer_type": "dropout",
"keep_prob": 0.5
},
{
"layer_type": "conv2d",
"num_filters": 128,
"kernel_size": 5,
"stride": 1,
"padding": "same"
},
{
"layer_type": "max_pool2d",
"pool_size": 2,
"stride": 2,
"padding": "same"
},
{
"layer_type": "l2_normalize"
},
{
"layer_type": "dropout",
"keep_prob": 0.5
},
{
"layer_type": "conv2d",
"num_filters": 256,
"kernel_size": 5,
"stride": 1,
"padding": "same"
},
{
"layer_type": "max_pool2d",
"pool_size": 2,
"stride": 2,
"padding": "same"
},
{
"layer_type": "l2_normalize"
},
{
"layer_type": "dropout",
"keep_prob": 0.5
},
{
"layer_type": "collapse_to_rnn_dims"
},
{
"layer_type": "birnn",
"num_hidden": 128,
"cell_type": "LSTM",
"activation": "tanh"
}
],
"output_layer": "ctc_decoder"
}
The training ctc loss suddenly drops on the first training epoch but it plateaus fluctuates for the rest of the epochs. The label error rate not only fluctuates but it doesn't really seem to go lower.
I should mention that the sequence length of each sample is really close to the length of the longest ground truth (i.e. from 1024, it becomes 32 by the time it enters the ctc_loss which is close to the longest ground truth length of 21).
As for the preprocessing of images, I made sure that they the aspect ratio is maintained when I resize it, and right padded the image to make it a square so that all the images will have the width and the handwritten words will be on the left. I also inverted the color of the images such that the handwritten characters have the highest pixel value (255) and the background while the rest have the lowest pixel value (0).
The predictions are something like this. A random set of strings on the first part then a bunch of zeroes at the end (which is probably expected because of the padding).
INFO:tensorflow:outputs = [[59 45 59 45 59 55 59 55 59 45 59 55 59 55 59 55 45 59 8 59 55 45 55 8
45 8 45 59 45 8 59 8 45 59 45 8 45 19 55 45 55 45 55 59 45 59 45 8
45 8 45 55 8 45 8 45 59 45 55 59 55 59 8 55 59 8 45 8 45 8 59 8
59 45 59 45 59 45 59 45 59 45 59 45 19 45 55 45 22 45 55 45 55 8 45 8
59 45 59 45 59 45 59 55 8 45 59 45 59 45 59 45 19 45 59 45 19 59 55 24
4 52 54 55]]
Here's how I collapse cnn outputs to rnn dims:
def collapse_to_rnn_dims(inputs):
batch_size, height, width, num_channels = inputs.get_shape().as_list()
if batch_size is None:
batch_size = -1
time_major_inputs = tf.transpose(inputs, (2, 0, 1, 3))
reshaped_time_major_inputs = tf.reshape(time_major_inputs,
[width, batch_size, height * num_channels]
)
batch_major_inputs = tf.transpose(reshaped_time_major_inputs, (1, 0, 2))
return batch_major_inputs
And here's how I collapse rnn to ctc dims:
def convert_to_ctc_dims(inputs, num_classes, num_steps, num_outputs):
outputs = tf.reshape(inputs, [-1, num_outputs])
logits = slim.fully_connected(outputs, num_classes,
weights_initializer=slim.xavier_initializer())
logits = slim.fully_connected(logits, num_classes,
weights_initializer=slim.xavier_initializer())
logits = tf.reshape(logits, [num_steps, -1, num_classes])
return logits