Running AWS SageMaker with a custom model, the TrainingJob fails with an Algorithm Error when using Keras plus a Tensorflow backend in multi-gpu configuration:
from keras.utils import multi_gpu_model
parallel_model = multi_gpu_model(model, gpus=K)
parallel_model.compile(loss='categorical_crossentropy',
optimizer='rmsprop')
parallel_model.fit(x, y, epochs=20, batch_size=256)
This simple parallel model loading will fail. There is no further error or exception from CloudWatch logging. This configuration works properly on local machine with 2x NVIDIA GTX 1080, same Keras Tensorflow backend.
According to SageMaker documentation and tutorials the multi_gpu_model
utility will work ok when Keras backend is MXNet, but I did not find any mention when the backend is Tensorflow with the same multi gpu configuration.
[UPDATE]
I have updated the code with the suggested answer below, and I'm adding some logging before the TrainingJob hangs
This logging repeats twice
2018-11-27 10:02:49.878414: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0, 1, 2, 3
2018-11-27 10:02:49.878462: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-11-27 10:02:49.878471: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 1 2 3
2018-11-27 10:02:49.878477: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N Y Y Y
2018-11-27 10:02:49.878481: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 1: Y N Y Y
2018-11-27 10:02:49.878486: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 2: Y Y N Y
2018-11-27 10:02:49.878492: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 3: Y Y Y N
2018-11-27 10:02:49.879340: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/device:GPU:0 with 14874 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:1b.0, compute capability: 7.0)
2018-11-27 10:02:49.879486: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/device:GPU:1 with 14874 MB memory) -> physical GPU (device: 1, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:1c.0, compute capability: 7.0)
2018-11-27 10:02:49.879694: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/device:GPU:2 with 14874 MB memory) -> physical GPU (device: 2, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:1d.0, compute capability: 7.0)
2018-11-27 10:02:49.879872: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/device:GPU:3 with 14874 MB memory) -> physical GPU (device: 3, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:1e.0, compute capability: 7.0)
Before there is some logging info about each GPU, that repeats 4 times
2018-11-27 10:02:46.447639: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 3 with properties:
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:00:1e.0
totalMemory: 15.78GiB freeMemory: 15.37GiB
According to the logging all the 4 GPUs are visible and loaded in the Tensorflow Keras backend. After that no application logging follows, the TrainingJob status is inProgress for a while, after that it becomes Failed with the same Algorithm Error.
Looking at CloudWatch logging I can see some metrics at work. Specifically GPU Memory Utilization
, CPU Utilization
are ok, while GPU utilization
is 0%.
[UPDATE]
Due to a known bug on Keras that is about saving a multi gpu model, I'm using this override of the multi_gpu_model utility in keras.utils
from keras.layers import Lambda, concatenate
from keras import Model
import tensorflow as tf
def multi_gpu_model(model, gpus):
#source: https://github.com/keras-team/keras/issues/8123#issuecomment-354857044
if isinstance(gpus, (list, tuple)):
num_gpus = len(gpus)
target_gpu_ids = gpus
else:
num_gpus = gpus
target_gpu_ids = range(num_gpus)
def get_slice(data, i, parts):
shape = tf.shape(data)
batch_size = shape[:1]
input_shape = shape[1:]
step = batch_size // parts
if i == num_gpus - 1:
size = batch_size - step * i
else:
size = step
size = tf.concat([size, input_shape], axis=0)
stride = tf.concat([step, input_shape * 0], axis=0)
start = stride * i
return tf.slice(data, start, size)
all_outputs = []
for i in range(len(model.outputs)):
all_outputs.append([])
# Place a copy of the model on each GPU,
# each getting a slice of the inputs.
for i, gpu_id in enumerate(target_gpu_ids):
with tf.device('/gpu:%d' % gpu_id):
with tf.name_scope('replica_%d' % gpu_id):
inputs = []
# Retrieve a slice of the input.
for x in model.inputs:
input_shape = tuple(x.get_shape().as_list())[1:]
slice_i = Lambda(get_slice,
output_shape=input_shape,
arguments={'i': i,
'parts': num_gpus})(x)
inputs.append(slice_i)
# Apply model on slice
# (creating a model replica on the target device).
outputs = model(inputs)
if not isinstance(outputs, list):
outputs = [outputs]
# Save the outputs for merging back together later.
for o in range(len(outputs)):
all_outputs[o].append(outputs[o])
# Merge outputs on CPU.
with tf.device('/cpu:0'):
merged = []
for name, outputs in zip(model.output_names, all_outputs):
merged.append(concatenate(outputs,
axis=0, name=name))
return Model(model.inputs, merged)
This works ok on local 2x NVIDIA GTX 1080 / Intel Xeon / Ubuntu 16.04
. It will fails on SageMaker Training Job.
I have posted this issue on AWS Sagemaker forum in
TrainingJob custom algorithm with Keras backend and multi GPU
SageMaker Fails when using Multi-GPU with keras.utils.multi_gpu_model
[UPDATE]
I have slightly modified the tf.session
code adding some initializers
with tf.Session() as session:
K.set_session(session)
session.run(tf.global_variables_initializer())
session.run(tf.tables_initializer())
and now at least I can see that one GPU (I assume device gpu:0
) is used from the instance metrics. The multi-gpu does not work anyways.