My goal is to test a custom object-detection training using the Google ML-Engine based on the pet-training example from the Object Detection API.
After some successful training cycles (maybe until the first checkpoint, since no checkpoint has been created) ...
15:46:56.784 global step 2257: loss = 0.7767 (1.70 sec/step)
15:46:56.821 global step 2258: loss = 1.3547 (1.13 sec/step)
... I received following error on several object detection training job trials:
Error reported to Coordinator: , {"created":"@1502286418.246034567","description":"OS Error","errno":104,"file":"external/grpc/src/core/lib/iomgr/tcp_posix.c","file_line":229,"grpc_status":14,"os_error":"Connection reset by peer","syscall":"recvmsg"}
I received it on worker-replica-0,3 and 4. Afterwards the job fails:
Command '['python', '-m', u'object_detection.train', u'--train_dir=gs://cartrainingbucket/train', u'--pipeline_config_path=gs://cartrainingbucket/data/faster_rcnn_resnet101.config', '--job-dir', u'gs://cartrainingbucket/train']' returned non-zero exit status -9
I'm using an adaptation of the faster_rcnn_resnet101.config, with following changes:
train_input_reader: {
tf_record_input_reader {
input_path: "gs://cartrainingbucket/data/vehicle_train.record"
}
label_map_path: "gs://cartrainingbucket/data/vehicle_label_map.pbtxt"
}
eval_config: {
num_examples: 2000
}
eval_input_reader: {
tf_record_input_reader {
input_path: "gs://cartrainingbucket/data/vehicle_val.record"
}
label_map_path: "gs://cartrainingbucket/data/vehicle_label_map.pbtxt"
shuffle: false
num_readers: 1
}
My bucket looks like this:
cartrainingbucket (Regional US-CENTRAL1)
--data/
--faster_rcnn_resnet101.config
--vehicle_label_map.pbtxt
--vehicle_train.record
--vehicle_val.record
--train/
--checkpoint
--events.out.tfevents.1502259105.master-556a4f538e-0-tmt52
--events.out.tfevents.1502264231.master-d3b4c71824-0-2733w
--events.out.tfevents.1502267118.master-7f8d859ac5-0-r5h8s
--events.out.tfevents.1502282824.master-acb4b4f78d-0-9d1mw
--events.out.tfevents.1502285815.master-1ef3af1094-0-lh9dx
--graph.pbtxt
--model.ckpt-0.data-00000-of-00001
--model.ckpt-0.index
--model.ckpt-0.meta
--packages/
I run the job using following command (using a windows cmd [^ should equal ]:
gcloud ml-engine jobs submit training stefan_object_detection_09_08_2017i ^
--job-dir=gs://cartrainingbucket/train ^
--packages dist/object_detection-0.1.tar.gz,slim/dist/slim-0.1.tar.gz ^
--module-name object_detection.train ^
--region us-central1 ^
--config object_detection/samples/cloud/cloud.yml ^
-- ^
--train_dir=gs://cartrainingbucket/train ^
--pipeline_config_path=gs://cartrainingbucket/data/faster_rcnn_resnet101.config
the cloud.yml is the default one:
trainingInput:
runtimeVersion: "1.0" # i also tried 1.2, in this case the failure appeared earlier in training
scaleTier: CUSTOM
masterType: standard_gpu
workerCount: 5
workerType: standard_gpu
parameterServerCount: 3
parameterServerType: standard
I'm using the currently latest Tensorflow Model master branch version (commit 36203f09dc257569be2fef3a950ddb2ac25dddeb). My locally installed TF version is 1.2 and I'm using python 3.5.1.
My training and validation records both work locally for training.
For me, as a Newbie, it's hard to see the problem's source. I'd be happy for any advice.