TensorFlow CNN: Why validation loss are significan

2019-09-19 12:51发布

This is a classification model for ten categories of pictures. My code has three files, one is the CNN model convNet.py, one is read_TFRecord.py to read data, one is train.py to train and evaluation model. Training set of samples of 80,000, validation set of sample of 20,000.

Question:

In the first epoch:

training loss = 2.11, train accuracy = 25.61%

validation loss = 3.05, validation accuracy = 8.29%

Why validation loss are significantly different right from the start? And why the validation accuracy is always below 10%?

In the 10 epoch of training:

The training process is always in normal learning. The validation loss in the slow increase, the validation accuracy has been shock in about 10%. Is it over-fitting? But I have taken some measures, such as adding regularized losses, droupout. I do not know where the problem is. I hope you can help me.

convNet.py:

def convNet(features, mode):
    input_layer = tf.reshape(features, [-1, 100, 100, 3])
    tf.summary.image('input', input_layer)

    # conv1
    with tf.name_scope('conv1'):
         conv1 = tf.layers.conv2d(
             inputs=input_layer,
             filters=32,
             kernel_size=5,
             padding="same",
             activation=tf.nn.relu,
             kernel_initializer=tf.truncated_normal_initializer(stddev=0.01),
             name='conv1'
         )
         conv1_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, 'conv1')
         tf.summary.histogram('kernel', conv1_vars[0])
         tf.summary.histogram('bias', conv1_vars[1])
         tf.summary.histogram('act', conv1)

    # pool1  100->50
    pool1 = tf.layers.max_pooling2d(inputs=conv1, pool_size=[2, 2], strides=2, name='pool1')

    # dropout
    pool1_dropout = tf.layers.dropout(
    inputs=pool1, rate=0.5, training=tf.equal(mode, learn.ModeKeys.TRAIN), name='pool1_dropout')

    # conv2
    with tf.name_scope('conv2'):
         conv2 = tf.layers.conv2d(
             inputs=pool1_dropout,
             filters=64,
             kernel_size=5,
             padding="same",
             activation=tf.nn.relu,
             kernel_initializer=tf.truncated_normal_initializer(stddev=0.01),
             name='conv2'
         )
         conv2_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, 'conv2')
         tf.summary.histogram('kernel', conv2_vars[0])
         tf.summary.histogram('bias', conv2_vars[1])
         tf.summary.histogram('act', conv2)

    # pool2  50->25
    pool2 = tf.layers.max_pooling2d(inputs=conv2, pool_size=[2, 2], strides=2, name='pool2')

    # dropout
    pool2_dropout = tf.layers.dropout(
    inputs=pool2, rate=0.5, training=tf.equal(mode, learn.ModeKeys.TRAIN), name='pool2_dropout')

    # conv3
    with tf.name_scope('conv3'):
         conv3 = tf.layers.conv2d(
             inputs=pool2_dropout,
             filters=128,
             kernel_size=3,
             padding="same",
             activation=tf.nn.relu,
             kernel_initializer=tf.truncated_normal_initializer(stddev=0.01),
             name='conv3'
         )
         conv3_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, 'conv3')
         tf.summary.histogram('kernel', conv3_vars[0])
         tf.summary.histogram('bias', conv3_vars[1])
         tf.summary.histogram('act', conv3)

    # pool3  25->12
    pool3 = tf.layers.max_pooling2d(inputs=conv3, pool_size=[2, 2], strides=2, name='pool3')

    # dropout
    pool3_dropout = tf.layers.dropout(
    inputs=pool3, rate=0.5, training=tf.equal(mode, learn.ModeKeys.TRAIN), name='pool3_dropout')

    # conv4
    with tf.name_scope('conv4'):
         conv4 = tf.layers.conv2d(
             inputs=pool3_dropout,
             filters=128,
             kernel_size=3,
             padding="same",
             activation=tf.nn.relu,
             kernel_initializer=tf.truncated_normal_initializer(stddev=0.01),
             name='conv4'
         )
         conv4_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, 'conv4')
         tf.summary.histogram('kernel', conv4_vars[0])
         tf.summary.histogram('bias', conv4_vars[1])
         tf.summary.histogram('act', conv4)

    # pool4  12->6
    pool4 = tf.layers.max_pooling2d(inputs=conv4, pool_size=[2, 2], strides=2, name='pool4')

    # dropout
    pool4_dropout = tf.layers.dropout(
    inputs=pool4, rate=0.5, training=tf.equal(mode, learn.ModeKeys.TRAIN), name='pool4_dropout')

    pool4_flat = tf.reshape(pool4_dropout, [-1, 6 * 6 * 128])

    # fc1
    with tf.name_scope('fc1'):
         fc1 = tf.layers.dense(inputs=pool4_flat, units=1024, activation=tf.nn.relu,
                          kernel_initializer=tf.truncated_normal_initializer(stddev=0.01),
                          kernel_regularizer=tf.contrib.layers.l2_regularizer(0.01),
                          name='fc1')
         fc1_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, 'fc1')
         tf.summary.histogram('kernel', fc1_vars[0])
         tf.summary.histogram('bias', fc1_vars[1])
         tf.summary.histogram('act', fc1)

    # dropout
    fc1_dropout = tf.layers.dropout(
    inputs=fc1, rate=0.3, training=tf.equal(mode, learn.ModeKeys.TRAIN), name='fc1_dropout')

    # fc2
    with tf.name_scope('fc2'):
         fc2 = tf.layers.dense(inputs=fc1_dropout, units=512, activation=tf.nn.relu,
                          kernel_initializer=tf.truncated_normal_initializer(stddev=0.01),
                          kernel_regularizer=tf.contrib.layers.l2_regularizer(0.01),
                          name='fc2')
         fc2_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, 'fc2')
         tf.summary.histogram('kernel', fc2_vars[0])
         tf.summary.histogram('bias', fc2_vars[1])
         tf.summary.histogram('act', fc2)

    # dropout
    fc2_dropout = tf.layers.dropout(
    inputs=fc2, rate=0.3, training=tf.equal(mode, learn.ModeKeys.TRAIN), name='fc2_dropout')

    # logits
    with tf.name_scope('out'):
         logits = tf.layers.dense(inputs=fc2_dropout, units=10, activation=None,
                             kernel_initializer=tf.truncated_normal_initializer(stddev=0.01),
                             kernel_regularizer=tf.contrib.layers.l2_regularizer(0.01),
                             name='out')
         out_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, 'out')
         tf.summary.histogram('kernel', out_vars[0])
         tf.summary.histogram('bias', out_vars[1])
         tf.summary.histogram('act', logits)

     return logits

read_TFRecord.py:

def read_and_decode(filename, width, height, channel):
filename_queue = tf.train.string_input_producer([filename])
reader = tf.TFRecordReader()
_, serialized_example = reader.read(filename_queue)
features = tf.parse_single_example(serialized_example,
                                   features={
                                       'label': tf.FixedLenFeature([], tf.int64),
                                       'img_raw': tf.FixedLenFeature([], tf.string),
                                   })
img = tf.decode_raw(features['img_raw'], tf.uint8)
img = tf.reshape(img, [width, height, channel])
img = tf.cast(img, tf.float16) * (1. / 255) - 0.5
label = tf.cast(features['label'], tf.int16)
return img, label

train.py:

# step 1
TRAIN_TFRECORD = 'F:/10-image-set2/train.tfrecords'  # train data set
VAL_TFRECORD = 'F:/10-image-set2/val.tfrecords'  # validation data set
WIDTH = 100  # image width
HEIGHT = 100  # image height
CHANNEL = 3  # image channel
TRAIN_BATCH_SIZE = 64
VAL_BATCH_SIZE = 16
train_img, train_label = read_and_decode(TRAIN_TFRECORD, WIDTH, HEIGHT, 
                         CHANNEL)
val_img, val_label = read_and_decode(VAL_TFRECORD, WIDTH, HEIGHT, CHANNEL)
x_train_batch, y_train_batch = tf.train.shuffle_batch([train_img, 
                               train_label], batch_size=TRAIN_BATCH_SIZE, 
                               capacity=80000,min_after_dequeue=79999, 
                               num_threads=64,name='train_shuffle_batch')
x_val_batch, y_val_batch = tf.train.shuffle_batch([val_img, val_label],
                           batch_size=VAL_BATCH_SIZE, 
                           capacity=20000,min_after_dequeue=19999, 
                           num_threads=64, name='val_shuffle_batch')

# step 2
x = tf.placeholder(tf.float32, shape=[None, WIDTH, HEIGHT, CHANNEL], 
                   name='x')
y_ = tf.placeholder(tf.int32, shape=[None, ], name='y_')
mode = tf.placeholder(tf.string, name='mode')
step = tf.get_variable(shape=(), dtype=tf.int32,     initializer=tf.zeros_initializer(), name='step')
tf.add_to_collection(tf.GraphKeys.GLOBAL_STEP, step)
logits = convNet(x, mode) 
with tf.name_scope('Reg_losses'):
     reg_losses = tf.cond(tf.equal(mode, learn.ModeKeys.TRAIN),
                     lambda: tf.add_n(tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES)),
                     lambda: tf.constant(0, dtype=tf.float32))
with tf.name_scope('Loss'):
     loss = tf.losses.sparse_softmax_cross_entropy(labels=y_, logits=logits) + reg_losses
train_op = tf.train.AdamOptimizer().minimize(loss, step)
correct_prediction = tf.equal(tf.cast(tf.argmax(logits, 1), tf.int32), y_)
with tf.name_scope('Accuracy'):
     acc = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

# step 3
tf.summary.scalar("reg_losses", reg_losses)
tf.summary.scalar("loss", loss)
tf.summary.scalar("accuracy", acc)
merged = tf.summary.merge_all()

# step 4
with tf.Session() as sess:
     summary_dir = './logs/summary/'

     sess.run(tf.global_variables_initializer())
     saver = tf.train.Saver()   
     saver = tf.train.Saver(max_to_keep=1)

     train_writer = tf.summary.FileWriter(summary_dir + 'train',
                                     sess.graph)
     valid_writer = tf.summary.FileWriter(summary_dir + 'valid')

     coord = tf.train.Coordinator()
     threads = tf.train.start_queue_runners(sess=sess, coord=coord) 
     max_acc = 0
     MAX_EPOCH = 10
     for epoch in range(MAX_EPOCH):
         # training
         train_step = int(80000 / TRAIN_BATCH_SIZE)
         train_loss, train_acc = 0, 0
         for step in range(epoch * train_step, (epoch + 1) * train_step):
             x_train, y_train = sess.run([x_train_batch, y_train_batch])
             train_summary, _, err, ac = sess.run([merged, train_op, loss, acc],
                                             feed_dict={x: x_train, y_: y_train,
                                                        mode: learn.ModeKeys.TRAIN,
                                                        global_step: step})
            train_loss += err
            train_acc += ac
            if (step + 1) % 50 == 0:
                train_writer.add_summary(train_summary, step)
         print("Epoch %d,train loss= %.2f,train accuracy=%.2f%%" % (
          epoch, (train_loss / train_step), (train_acc / train_step * 100.0)))

         # validation
         val_step = int(20000 / VAL_BATCH_SIZE)
         val_loss, val_acc = 0, 0
         for step in range(epoch * val_step, (epoch + 1) * val_step):
             x_val, y_val = sess.run([x_val_batch, y_val_batch])
             val_summary, err, ac = sess.run([merged, loss, acc],
                                        feed_dict={x: x_val, y_: y_val, mode: learn.ModeKeys.EVAL,
                                                   global_step: step})
             val_loss += err
             val_acc += ac
             if (step + 1) % 50 == 0:
                 valid_writer.add_summary(val_summary, step)
         print(
           "Epoch %d,validation loss= %.2f,validation accuracy=%.2f%%" % (
            epoch, (val_loss / val_step), (val_acc / val_step * 100.0)))

         # save model
         if val_acc > max_acc:
             max_acc = val_acc
             saver.save(sess, summary_dir + '/10-image.ckpt', epoch)
             print("model saved")
coord.request_stop()
coord.join(threads)

Tensorboard result:

(Orange is train.Blue is validation.)

accuracy-loss-reg_losses-conv1-conv2-conv3-conv4-fc1-fc2-output

My data:

train-val

1条回答
时光不老,我们不散
2楼-- · 2019-09-19 13:11

I doubt this is an issue of overfitting - the losses are significantly different right from the start and diverge further well before you get through your first epoch (~500 batches). Without seeing your dataset its difficult to say more, though as a first step I'd encourage you to visualize both training and evaluation input data to make sure the issue isn't something there. The fact that you get significantly less than 10% on a 10-class classification problem initially indicates you almost certainly don't have something wrong here.

Having said that, you will likely run into problems with overfitting using this model because, despite what you may think, you aren't using dropout or regularization.

Dropout: mode == learn.ModeKeys is false if mode is a tensor, so you're not using dropout ever. You could use tf.equals(mode, learn.ModeKeys), but I think you'd be much better off passing a training bool tensor to your convNet and feeding in the appropriate value.

Regularization: you're creating the regularization loss terms and they're being added to the tf.GraphKeys.REGULARIZATION_LOSSES collection, but the loss you're minimizing doesn't use them. Add the following:

loss += tf.add_n(tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES))

before you optimize.

A note on the optimization step: you shouldn't be feeding in a value to the session run like you are. Every time you run to optimization operation it will update the value passed to step when you created it, so just create it with an int variable and leave it alone. See the following example code:

import tensorflow as tf

x = tf.get_variable(shape=(4, 3), dtype=tf.float32,
                    initializer=tf.random_normal_initializer(), name='x')
loss = tf.nn.l2_loss(x)
step = tf.get_variable(shape=(), dtype=tf.int32,
                       initializer=tf.zeros_initializer(), name='step')
tf.add_to_collection(tf.GraphKeys.GLOBAL_STEP, step)  # good practice

opt = tf.train.AdamOptimizer().minimize(loss, step)

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    step_val = sess.run(step)
    print(step_val)  # 0
    sess.run(opt)
    step_val = sess.run(step)
    print(step_val)  # 1
查看更多
登录 后发表回答