ERROR: Couldn't match files for checkpoint gs:

2019-08-10 03:21发布

问题:

I run my detection model on google cloud ml and got this error while running the evaluation script. I found this link that mentioned about this issue, but it seems like the issue's till not be solved. Anyone knows how to fix this? Any helps would be appreciated. Thanks.

ERROR 2018-02-04 12:53:10 -0600 master-replica-0 Couldn't match files for checkpoint gs://obj-detection/train/model.ckpt-0

INFO 2018-02-04 12:53:10 -0600 master-replica-0 No model found in gs://obj-detection/train. Will try again in 300 seconds

INFO 2018-02-04 12:58:10 -0600 master-replica-0 Starting evaluation at 2018-02-04-18:58:10

ERROR 2018-02-04 12:58:10 -0600 master-replica-0 Couldn't match files for checkpoint gs://obj-detection/train/model.ckpt-0

INFO 2018-02-04 12:58:10 -0600 master-replica-0 No model found in gs://obj-detection/train. Will try again in 300 seconds

...

While the training log is working as below:

... at somewhere around 14 hours running

INFO 2018-02-04 05:09:05 -0600 worker-replica-3 global step 185874: loss = 0.7012 (0.764 sec/step)

INFO 2018-02-04 05:09:05 -0600 worker-replica-4 global step 185873: loss = 0.7749 (0.797 sec/step)

INFO 2018-02-04 05:09:05 -0600 worker-replica-2 global step 185875: loss = 0.4939 (0.775 sec/step)

INFO 2018-02-04 05:09:05 -0600 master-replica-0 global step 185877: loss = 1.1430 (0.850 sec/step)

INFO 2018-02-04 05:09:05 -0600 worker-replica-1 global step 185878: loss = 0.8231 (0.777 sec/step)

INFO 2018-02-04 05:09:05 -0600 worker-replica-0 global step 185881: loss = 0.6470 (0.779 sec/step)

回答1:

A few things to check:

  1. Is the training code setup to actually export checkpoints? If you're using an Estimator, this generally works, assuming you're using the standard methods for running the Estimator (e.g., in TF >=1.4, Estimator.train_and_evaluate).
  2. Are you passing the correct output directory to the code that is saving checkpoints? For instance, could the training code be outputting the checkpoint to a local (temporary?) directory instead of GCS? Could it be saving the checkpoints to a different directory on GCS? A quick scan of the code + some well placed print/logging statements are useful here.
  3. How frequently does the training code export checkpoints? e.g., if it saves only 10 minutes, then you would expect about 1-2 "no model found" messages for every successful evaluation.