I'm trying the keras cloudml sample (https://github.com/GoogleCloudPlatform/cloudml-samples/tree/master/census/keras) and I seem unable to run the cloud training. The local training, both with python and gcloud seem to go well.
I've looked for a solution on stackexchange, google and read https://cloud.google.com/ml-engine/docs/how-tos/troubleshooting, but I seem to be the only one with this problem (usually a strong indication the fault is entirely mine!) . In addition to the environment below, I've tried with python 3.6 and tensorflow 1.3 with no success.
I'm a noob, so I'm probably erring in some basic way, but I cannot spot it.
All and any help is appreciated,
:-)
yarc68000.
--environment -
(env1) $ python --version
Python 2.7.13 :: Continuum Analytics, Inc.
(env1) $ conda list | grep 'h5py\|keras\|pandas\|numexpr\|tensorflow'
h5py 2.7.1 py27_1 conda-forge
keras 2.0.6 py27_0 conda-forge
numexpr 2.6.2 py27_1 conda-forge
pandas 0.20.3 py27_0 anaconda
tensorflow 1.2.1 <pip>
(env1) $ gcloud --version
Google Cloud SDK 172.0.1
alpha 2017.09.15
beta 2017.09.15
bq 2.0.26
core 2017.09.21
datalab 20170818
gcloud
gsutil 4.27
----------- job --------
(env1) $ export BUCKET=gs://j170922census1
(env1) $ gsutil mb $BUCKET
Creating gs://j170922census1/...
(env1) $ export TRAIN_FILE=gs://cloudml-public/census/data/adult.data.csv
(env1) $ export EVAL_FILE=gs://cloudml-public/census/data/adult.test.csv
(env1) $ export JOB_NAME="census_keras_$$"
(env1) $ export TRAIN_STEPS=200
(env1) $ gcloud ml-engine jobs submit training $JOB_NAME --stream-logs --runtime-version 1.2 --job-dir $BUCKET --package-path trainer --module-name trainer.task --region us-central1 -- --train-files $TRAIN_FILE --eval-files $EVAL_FILE --train-steps $TRAIN_STEPS
Job [census_keras_7639] submitted successfully.
INFO 2017-09-22 19:56:56 +0200 service Validating job requirements...
INFO 2017-09-22 19:56:57 +0200 service Job creation request has been successfully validated.
INFO 2017-09-22 19:56:57 +0200 service Job census_keras_7639 is queued.
INFO 2017-09-22 19:56:57 +0200 service Waiting for job to be provisioned.
INFO 2017-09-22 20:01:39 +0200 service Waiting for TensorFlow to start.
INFO 2017-09-22 20:02:55 +0200 master-replica-0 Running task with arguments: --cluster={"master": ["master-cc38d44a51-0:2222"]} --task={"type": "master", "index": 0} --job={
<..>
INFO 2017-09-22 20:04:00 +0200 master-replica-0 197/200 [============================>.] - ETA: 0s - loss: 0.6931 - acc: 0.7563
INFO 2017-09-22 20:04:00 +0200 master-replica-0 200/200 [==============================] - 1s - loss: 0.6931 - acc: 0.7600
INFO 2017-09-22 20:04:00 +0200 master-replica-0 Epoch 10/20
ERROR 2017-09-22 20:04:02 +0200 master-replica-0 Traceback (most recent call last):
ERROR 2017-09-22 20:04:02 +0200 master-replica-0 File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main
ERROR 2017-09-22 20:04:02 +0200 master-replica-0 "__main__", fname, loader, pkg_name)
ERROR 2017-09-22 20:04:02 +0200 master-replica-0 File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
ERROR 2017-09-22 20:04:02 +0200 master-replica-0 exec code in run_globals
ERROR 2017-09-22 20:04:02 +0200 master-replica-0 File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 199, in <module>
ERROR 2017-09-22 20:04:02 +0200 master-replica-0 dispatch(**parse_args.__dict__)
ERROR 2017-09-22 20:04:02 +0200 master-replica-0 File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 121, in dispatch
ERROR 2017-09-22 20:04:02 +0200 master-replica-0 callbacks=callbacks)
ERROR 2017-09-22 20:04:02 +0200 master-replica-0 File "/root/.local/lib/python2.7/site-packages/keras/legacy/interfaces.py", line 88, in wrapper
ERROR 2017-09-22 20:04:02 +0200 master-replica-0 return func(*args, **kwargs)
ERROR 2017-09-22 20:04:02 +0200 master-replica-0 File "/root/.local/lib/python2.7/site-packages/keras/models.py", line 1110, in fit_generator
ERROR 2017-09-22 20:04:02 +0200 master-replica-0 initial_epoch=initial_epoch)
ERROR 2017-09-22 20:04:02 +0200 master-replica-0 File "/root/.local/lib/python2.7/site-packages/keras/legacy/interfaces.py", line 88, in wrapper
ERROR 2017-09-22 20:04:02 +0200 master-replica-0 return func(*args, **kwargs)
ERROR 2017-09-22 20:04:02 +0200 master-replica-0 File "/root/.local/lib/python2.7/site-packages/keras/engine/training.py", line 1849, in fit_generator
ERROR 2017-09-22 20:04:02 +0200 master-replica-0 callbacks.on_epoch_begin(epoch)
ERROR 2017-09-22 20:04:02 +0200 master-replica-0 File "/root/.local/lib/python2.7/site-packages/keras/callbacks.py", line 63, in on_epoch_begin
ERROR 2017-09-22 20:04:02 +0200 master-replica-0 callback.on_epoch_begin(epoch, logs)
ERROR 2017-09-22 20:04:02 +0200 master-replica-0 File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 57, in on_epoch_begin
ERROR 2017-09-22 20:04:02 +0200 master-replica-0 census_model = load_model(checkpoints[-1])
ERROR 2017-09-22 20:04:02 +0200 master-replica-0 IndexError: list index out of range
<..>
INFO 2017-09-22 20:04:53 +0200 service Finished tearing down TensorFlow.
INFO 2017-09-22 20:05:49 +0200 service Job failed.