Why does this tensorflow code crash?

I've built a toy model for image classification. The program is loosely structured like the cifar10 tutorial. Training starts fine, but eventually the program crashes. I've finalized the graph just in case somewhere ops were being added to it, and in tensorboard it looks great, but without fail it eventually freezes and forces a hard restart (or long wait for an eventual reboot). The exit makes it seem like a GPU memory issue, but the model is small and should fit. If I allocate the full GPU memory (which gives another 4gb), it will still crash.

The data are 256x256x3 images and labels stored in a tfrecords file. The training function code looks like:

def train():
    with tf.Graph().as_default():
         global_step = tf.contrib.framework.get_or_create_global_step()
         train_images_batch, train_labels_batch = distorted_inputs(batch_size=BATCH_SIZE)
         train_logits = inference(train_images_batch)
         train_batch_loss = loss(train_logits, train_labels_batch)
         train_op = training(train_batch_loss, global_step, 0.1)

         merged = tf.summary.merge_all()
         saver = tf.train.Saver(tf.global_variables())
         gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.75)
         sess_config=tf.ConfigProto(gpu_options=gpu_options)
         sess = tf.Session(config=sess_config)
         train_summary_writer = tf.summary.FileWriter(
         os.path.join(ROOT, 'logs', 'train'), sess.graph)
         init = tf.global_variables_initializer()

         sess.run(init)
         coord = tf.train.Coordinator()
         threads = tf.train.start_queue_runners(sess=sess, coord=coord)

         tf.Graph().finalize()
         for i in range(5540):
             start_time = time.time()
             summary, _, batch_loss = sess.run([merged, train_op, train_batch_loss])
             duration = time.time() - start_time
             train_summary_writer.add_summary(summary, i)
             if i % 10 == 0:
                 msg = 'batch: {} loss: {:.6f} time: {:.8} sec/batch'.format(
                 i, batch_loss, str(time.time() - start_time))
                 print(msg)
         coord.request_stop()
         coord.join(threads)
         sess.close()

The loss and training op are cross_entropy and the adam optimizer respectively:

def loss(logits, labels):
    xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=labels, logits=logits, name='cross_entropy_per_example')
    xentropy_mean = tf.reduce_mean(xentropy, name='cross_entropy')
    tf.add_to_collection('losses', xentropy_mean)
    return xentropy_mean

def training(loss, global_step, learning_rate):
    optimizer = tf.train.AdamOptimizer(learning_rate)
    train_op = optimizer.minimize(loss, global_step=global_step)
    return train_op

And the batches are generated with

 def distorted_inputs(batch_size):
     filename_queue = tf.train.string_input_producer(
         ['data/train.tfrecords'], num_epochs=None)
    reader = tf.TFRecordReader()
    _, serialized_example = reader.read(filename_queue)
    features = tf.parse_single_example(serialized_example,
        features={'label': tf.FixedLenFeature([], tf.int64),
                  'image': tf.FixedLenFeature([], tf.string)})
    label = features['label']
    label = tf.cast(label, tf.int32)
    image = tf.decode_raw(features['image'], tf.uint8)
    image = (tf.cast(image, tf.float32) / 255) - 0.5
    image = tf.reshape(image, shape=[256, 256, 3])
    # data augmentation
    image = tf.image.random_flip_up_down(image)
    image = tf.image.random_flip_left_right(image)
    print('filling the queue with {} images ' \
          'before starting to train'.format(MIN_QUEUE_EXAMPLES))
    return _generate_batch(image, label, MIN_QUEUE_EXAMPLES, BATCH_SIZE)

and

def _generate_batch(image, label,
                    min_queue_examples=MIN_QUEUE_EXAMPLES,
                    batch_size=BATCH_SIZE):
    images_batch, labels_batch = tf.train.shuffle_batch(
        [image, label], batch_size=batch_size,
        num_threads=12, capacity=min_queue_examples + 3 * BATCH_SIZE,
        min_after_dequeue=min_queue_examples)
    tf.summary.image('images', images_batch)
    return images_batch, labels_batch

What am I missing?

标签： python machine-learning tensorflow

1条回答

我命由我不由天

2楼-- · 2019-07-30 19:45

So I resolved this. Here's the solution in case it's useful to someone else. TL,DR: it's a hardware issue.

Specifically, it's a PCIe bus error, the same error as that with the most votes here. Possibly this is caused by message signalled interrupts being incompatible with the PLX switches, as suggested here. Also in that thread is what resolved the issue, setting kernel parameter pci=nommconf to disable the msi's.

Between Tensorflow, Torch, and Theano, tf is the only deep learning framework that triggers this issue. Why, I'm not sure.

0人赞添加讨论(0) 举报

Why does this tensorflow code crash?

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间