Google Cloud ML exited with a non-zero status of 2

2019-08-03 09:42发布

问题:

I tried to train my model on Google Cloud ML using this sample code:

import keras
from keras import optimizers
from keras import losses
from keras import metrics
from keras.models import Model, Sequential
from keras.layers import Dense, Lambda, RepeatVector, TimeDistributed
import numpy as np

def test():
    model = Sequential()
    model.add(Dense(2, input_shape=(3,)))
    model.add(RepeatVector(3))
    model.add(TimeDistributed(Dense(3)))
    model.compile(loss=losses.MSE,
                  optimizer=optimizers.RMSprop(lr=0.0001),
                  metrics=[metrics.categorical_accuracy],
                  sample_weight_mode='temporal')
    x = np.random.random((1, 3))
    y = np.random.random((1, 3, 3))
    model.train_on_batch(x, y)

if __name__ == '__main__':
    test()

and i got this error:

The replica master 0 exited with a non-zero status of 245. Termination reason: Error.

Detailed error output is big, so i'm pasting it here in pastebin

回答1:

Note this output:

Module raised an exception for failing to call a subprocess Command '['python', '-m', u'trainer.test', '--job-dir', u'gs://my_test_bucket_keras/s_27_100630']' returned non-zero exit status -11.

And I guess the google cloud will run your code with an extra parameter called --job-dir. So perhaps you can try add the following code in your example code?

import ...
import argparse

def test():
model = Sequential()
model.add(Dense(2, input_shape=(3,)))
model.add(RepeatVector(3))
model.add(TimeDistributed(Dense(3)))
model.compile(loss=losses.MSE,
              optimizer=optimizers.RMSprop(lr=0.0001),
              metrics=[metrics.categorical_accuracy],
              sample_weight_mode='temporal')
x = np.random.random((1, 3))
y = np.random.random((1, 3, 3))
model.train_on_batch(x, y)

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    # Input Arguments
    parser.add_argument(
      '--job-dir',
      help='GCS location to write checkpoints and export models',
      required=True
    )
    args = parser.parse_args()
    arguments = args.__dict__

    test()
    # test(**arguments) # or if you want to use this job_dir parameter in your code

Not 100% sure this will work but I think you can give it a try. Also I have a post here to do something similar, perhaps you can take a look there as well.



回答2:

Problem is resolved. All I had to do is use tensorflow 1.1.0 instead default 1.0.1