Multi-GPU model ( LSTM with Stateful ) on Keras is

2019-08-02 18:22发布

问题:

I am working on LSTM model with stateful using keras (Tensorflow backend); I cannot parallelize it on multi-GPU platform. here is link to code. I am getting following error.

tensorflow.python.framework.errors_impl.InvalidArgumentError: Incompatible shapes: [256,75,39] vs. [512,75,39]

[[Node: training/cna/gradients/loss/concatenate_1_loss/mul_grad/BroadcastGradientArgs = BroadcastGradientArgs[T=DT_INT32, _class=["loc:@loss/concatenate_1_loss/mul"], _device="/job:localhost/replica:0/task:0/gpu:0"](training/cna/gradients/loss/concatenate_1_loss/mul_grad/Shape, training/cna/gradients/loss/concatenate_1_loss/mul_grad/Shape_1)]]

[[Node: replica_1/sequential_1/dense_1/truediv/_473 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:1", send_device_incarnation=1, tensor_name="edge_3032_replica_1/sequential_1/dense_1/truediv", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]]

I am using 2 GPU with batch size of 256. Please help.

Thanks in advance.

回答1:

This error seems to happen simply because you're dividing an original batch with size 512 in two smaller batches with size 256.

Stateful layers require a fixed batch size (see the parameter batch_shape or batch_input_shape at the beginning of the model).

You may try to recreate the model changing the batch_shape (or batch_input_shape) to 256 (if it's currently 512). Or the other way around if I'm mistaken about the current value.

If you have already a trained model with weights you want to keep, you can create another model with the same type of layers and the same shapes, changing only the input shape. Then you can newModel.set_weights(oldModel.get_weights())


That said, I don't think it's safe to parallelize a stateful model. In stateful models, "batch2" is the sequel of "batch1". Both batches represent the "same" sequence, and the order is absolutely important. If batch2 gets processed before batch1, you will be inputting an inverted sequence and your model will learn it wrong.

Unless you find it explicitly stated by Keras documentation that you can safely parallelize a stateful model, you might benefit from checking carefully (after lots of attempts) if the parallelized model always gives the same result as the single GPU model.



回答2:

I'm currently working on stateful_multi_gpu, an experimental utility to build stateful RNN models for multi GPU training.

In contrary to Daniel Möller's answer, I think you can explicitly state the order: what sub-batch is processed on which GPU and how the results are put back together.

I still need to test if it correctly trains on multiple GPUs, and if it can parallelise any arbitrary stateful model. Thus, I'm interested in anyones experience using this utility!