Increasing Label Error Rate (Edit Distance) and Fl

I am training a handwriting recognition model of this architecture:

{
"network": [
{
"layer_type": "l2_normalize"
},
{
"layer_type": "conv2d",
"num_filters": 16,
"kernel_size": 5,
"stride": 1,
"padding": "same"
},
{
"layer_type": "max_pool2d",
"pool_size": 2,
"stride": 2,
"padding": "same"
},
{
"layer_type": "l2_normalize"
},
{
"layer_type": "dropout",
"keep_prob": 0.5
},
{
"layer_type": "conv2d",
"num_filters": 32,
"kernel_size": 5,
"stride": 1,
"padding": "same"
},
{
"layer_type": "max_pool2d",
"pool_size": 2,
"stride": 2,
"padding": "same"
},
{
"layer_type": "l2_normalize"
},
{
"layer_type": "dropout",
"keep_prob": 0.5
},
{
"layer_type": "conv2d",
"num_filters": 64,
"kernel_size": 5,
"stride": 1,
"padding": "same"
},
{
"layer_type": "max_pool2d",
"pool_size": 2,
"stride": 2,
"padding": "same"
},
{
"layer_type": "l2_normalize"
},
{
"layer_type": "dropout",
"keep_prob": 0.5
},
{
"layer_type": "conv2d",
"num_filters": 128,
"kernel_size": 5,
"stride": 1,
"padding": "same"
},
{
"layer_type": "max_pool2d",
"pool_size": 2,
"stride": 2,
"padding": "same"
},
{
"layer_type": "l2_normalize"
},
{
"layer_type": "dropout",
"keep_prob": 0.5
},
{
"layer_type": "conv2d",
"num_filters": 256,
"kernel_size": 5,
"stride": 1,
"padding": "same"
},
{
"layer_type": "max_pool2d",
"pool_size": 2,
"stride": 2,
"padding": "same"
},
{
"layer_type": "l2_normalize"
},
{
"layer_type": "dropout",
"keep_prob": 0.5
},
{
"layer_type": "collapse_to_rnn_dims"
},
{
"layer_type": "birnn",
"num_hidden": 128,
"cell_type": "LSTM",
"activation": "tanh"
}
],
"output_layer": "ctc_decoder"
}

The training ctc loss suddenly drops on the first training epoch but it ~~plateaus~~ fluctuates for the rest of the epochs. The label error rate not only fluctuates but it doesn't really seem to go lower.

I should mention that the sequence length of each sample is really close to the length of the longest ground truth (i.e. from 1024, it becomes 32 by the time it enters the ctc_loss which is close to the longest ground truth length of 21).

As for the preprocessing of images, I made sure that they the aspect ratio is maintained when I resize it, and right padded the image ~~to make it a square~~ so that all the images will have the width and the handwritten words will be on the left. I also inverted the color of the images such that the handwritten characters have the highest pixel value (255) and the background while the rest have the lowest pixel value (0).

The predictions are something like this. A random set of strings ~~on the first part then a bunch of zeroes at the end (which is probably expected because of the padding)~~.

INFO:tensorflow:outputs = [[59 45 59 45 59 55 59 55 59 45 59 55 59 55 59 55 45 59  8 59 55 45 55  8
  45  8 45 59 45  8 59  8 45 59 45  8 45 19 55 45 55 45 55 59 45 59 45  8
  45  8 45 55  8 45  8 45 59 45 55 59 55 59  8 55 59  8 45  8 45  8 59  8
  59 45 59 45 59 45 59 45 59 45 59 45 19 45 55 45 22 45 55 45 55  8 45  8
  59 45 59 45 59 45 59 55  8 45 59 45 59 45 59 45 19 45 59 45 19 59 55 24
   4 52 54 55]]

Here's how I collapse cnn outputs to rnn dims:

def collapse_to_rnn_dims(inputs):
    batch_size, height, width, num_channels = inputs.get_shape().as_list()
    if batch_size is None:
        batch_size = -1
    time_major_inputs = tf.transpose(inputs, (2, 0, 1, 3))
    reshaped_time_major_inputs = tf.reshape(time_major_inputs,
                                            [width, batch_size, height * num_channels]
                                            )
    batch_major_inputs = tf.transpose(reshaped_time_major_inputs, (1, 0, 2))
    return batch_major_inputs

And here's how I collapse rnn to ctc dims:

def convert_to_ctc_dims(inputs, num_classes, num_steps, num_outputs):
    outputs = tf.reshape(inputs, [-1, num_outputs])
    logits = slim.fully_connected(outputs, num_classes,
                                  weights_initializer=slim.xavier_initializer())
    logits = slim.fully_connected(logits, num_classes,
                                  weights_initializer=slim.xavier_initializer())
    logits = tf.reshape(logits, [num_steps, -1, num_classes])
    return logits

标签： tensorflow deep-learning lstm convolution handwriting-recognition

0条回答

Increasing Label Error Rate (Edit Distance) and Fl

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间