Keras google cloudml sample: IndexError

2019-08-27 01:31发布

I'm trying the keras cloudml sample (https://github.com/GoogleCloudPlatform/cloudml-samples/tree/master/census/keras) and I seem unable to run the cloud training. The local training, both with python and gcloud seem to go well.

I've looked for a solution on stackexchange, google and read https://cloud.google.com/ml-engine/docs/how-tos/troubleshooting, but I seem to be the only one with this problem (usually a strong indication the fault is entirely mine!) . In addition to the environment below, I've tried with python 3.6 and tensorflow 1.3 with no success.

I'm a noob, so I'm probably erring in some basic way, but I cannot spot it.

All and any help is appreciated,

:-)

yarc68000.

--environment -

(env1) $ python --version
Python 2.7.13 :: Continuum Analytics, Inc.
(env1) $ conda list | grep 'h5py\|keras\|pandas\|numexpr\|tensorflow'
h5py                      2.7.1                    py27_1    conda-forge
keras                     2.0.6                    py27_0    conda-forge
numexpr                   2.6.2                    py27_1    conda-forge
pandas                    0.20.3                   py27_0    anaconda
tensorflow                1.2.1                     <pip>
(env1) $ gcloud --version
Google Cloud SDK 172.0.1
alpha 2017.09.15
beta 2017.09.15
bq 2.0.26
core 2017.09.21
datalab 20170818
gcloud 
gsutil 4.27

----------- job --------

(env1) $ export BUCKET=gs://j170922census1
(env1) $ gsutil mb $BUCKET
Creating gs://j170922census1/...
(env1) $ export TRAIN_FILE=gs://cloudml-public/census/data/adult.data.csv
(env1) $ export EVAL_FILE=gs://cloudml-public/census/data/adult.test.csv
(env1) $ export JOB_NAME="census_keras_$$"
(env1) $ export TRAIN_STEPS=200
(env1) $ gcloud ml-engine jobs submit training $JOB_NAME --stream-logs --runtime-version 1.2 --job-dir $BUCKET --package-path trainer --module-name trainer.task --region us-central1 -- --train-files $TRAIN_FILE --eval-files $EVAL_FILE --train-steps $TRAIN_STEPS
Job [census_keras_7639] submitted successfully.
INFO    2017-09-22 19:56:56 +0200   service     Validating job requirements...
INFO    2017-09-22 19:56:57 +0200   service     Job creation request has been successfully validated.
INFO    2017-09-22 19:56:57 +0200   service     Job census_keras_7639 is queued.
INFO    2017-09-22 19:56:57 +0200   service     Waiting for job to be provisioned.
INFO    2017-09-22 20:01:39 +0200   service     Waiting for TensorFlow to start.
INFO    2017-09-22 20:02:55 +0200   master-replica-0        Running task with arguments: --cluster={"master": ["master-cc38d44a51-0:2222"]} --task={"type": "master", "index": 0} --job={
<..>
INFO    2017-09-22 20:04:00 +0200   master-replica-0        197/200 [============================>.] - ETA: 0s - loss: 0.6931 - acc: 0.7563
INFO    2017-09-22 20:04:00 +0200   master-replica-0        200/200 [==============================] - 1s - loss: 0.6931 - acc: 0.7600     
INFO    2017-09-22 20:04:00 +0200   master-replica-0        Epoch 10/20
ERROR   2017-09-22 20:04:02 +0200   master-replica-0        Traceback (most recent call last):
ERROR   2017-09-22 20:04:02 +0200   master-replica-0          File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main
ERROR   2017-09-22 20:04:02 +0200   master-replica-0            "__main__", fname, loader, pkg_name)
ERROR   2017-09-22 20:04:02 +0200   master-replica-0          File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
ERROR   2017-09-22 20:04:02 +0200   master-replica-0            exec code in run_globals
ERROR   2017-09-22 20:04:02 +0200   master-replica-0          File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 199, in <module>
ERROR   2017-09-22 20:04:02 +0200   master-replica-0            dispatch(**parse_args.__dict__)
ERROR   2017-09-22 20:04:02 +0200   master-replica-0          File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 121, in dispatch
ERROR   2017-09-22 20:04:02 +0200   master-replica-0            callbacks=callbacks)
ERROR   2017-09-22 20:04:02 +0200   master-replica-0          File "/root/.local/lib/python2.7/site-packages/keras/legacy/interfaces.py", line 88, in wrapper
ERROR   2017-09-22 20:04:02 +0200   master-replica-0            return func(*args, **kwargs)
ERROR   2017-09-22 20:04:02 +0200   master-replica-0          File "/root/.local/lib/python2.7/site-packages/keras/models.py", line 1110, in fit_generator
ERROR   2017-09-22 20:04:02 +0200   master-replica-0            initial_epoch=initial_epoch)
ERROR   2017-09-22 20:04:02 +0200   master-replica-0          File "/root/.local/lib/python2.7/site-packages/keras/legacy/interfaces.py", line 88, in wrapper
ERROR   2017-09-22 20:04:02 +0200   master-replica-0            return func(*args, **kwargs)
ERROR   2017-09-22 20:04:02 +0200   master-replica-0          File "/root/.local/lib/python2.7/site-packages/keras/engine/training.py", line 1849, in fit_generator
ERROR   2017-09-22 20:04:02 +0200   master-replica-0            callbacks.on_epoch_begin(epoch)
ERROR   2017-09-22 20:04:02 +0200   master-replica-0          File "/root/.local/lib/python2.7/site-packages/keras/callbacks.py", line 63, in on_epoch_begin
ERROR   2017-09-22 20:04:02 +0200   master-replica-0            callback.on_epoch_begin(epoch, logs)
ERROR   2017-09-22 20:04:02 +0200   master-replica-0          File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 57, in on_epoch_begin
ERROR   2017-09-22 20:04:02 +0200   master-replica-0            census_model = load_model(checkpoints[-1])
ERROR   2017-09-22 20:04:02 +0200   master-replica-0        IndexError: list index out of range
<..>
INFO    2017-09-22 20:04:53 +0200   service     Finished tearing down TensorFlow.
INFO    2017-09-22 20:05:49 +0200   service     Job failed.

1条回答
小情绪 Triste *
2楼-- · 2019-08-27 02:07

There actually was a bug when running this on the Cloud ML Engine because the checkpoints are disabled for now on GCS (Keras can't natively write checkpoints to GCS). See this PR for the immediate fix for the issue you are facing. Also take a look at pending PR which fixes the checkpoint issue and makes files available on GCS (Workaround for the inability to do GCS writes for Keras).

查看更多
登录 后发表回答