"Connection reset by peer on adapted standard ML-E

My goal is to test a custom object-detection training using the Google ML-Engine based on the pet-training example from the Object Detection API.

After some successful training cycles (maybe until the first checkpoint, since no checkpoint has been created) ...

15:46:56.784 global step 2257: loss = 0.7767 (1.70 sec/step)

15:46:56.821 global step 2258: loss = 1.3547 (1.13 sec/step)

... I received following error on several object detection training job trials:

Error reported to Coordinator: , {"created":"@1502286418.246034567","description":"OS Error","errno":104,"file":"external/grpc/src/core/lib/iomgr/tcp_posix.c","file_line":229,"grpc_status":14,"os_error":"Connection reset by peer","syscall":"recvmsg"}

I received it on worker-replica-0,3 and 4. Afterwards the job fails:

Command '['python', '-m', u'object_detection.train', u'--train_dir=gs://cartrainingbucket/train', u'--pipeline_config_path=gs://cartrainingbucket/data/faster_rcnn_resnet101.config', '--job-dir', u'gs://cartrainingbucket/train']' returned non-zero exit status -9

I'm using an adaptation of the faster_rcnn_resnet101.config, with following changes:

train_input_reader: {
  tf_record_input_reader {
    input_path: "gs://cartrainingbucket/data/vehicle_train.record"
  }
  label_map_path: "gs://cartrainingbucket/data/vehicle_label_map.pbtxt"
}

eval_config: {
  num_examples: 2000
}

eval_input_reader: {
  tf_record_input_reader {
    input_path: "gs://cartrainingbucket/data/vehicle_val.record"
  }
  label_map_path: "gs://cartrainingbucket/data/vehicle_label_map.pbtxt"
  shuffle: false
  num_readers: 1
}

My bucket looks like this:

cartrainingbucket (Regional US-CENTRAL1)
--data/
  --faster_rcnn_resnet101.config
  --vehicle_label_map.pbtxt
  --vehicle_train.record
  --vehicle_val.record
--train/ 
  --checkpoint
  --events.out.tfevents.1502259105.master-556a4f538e-0-tmt52
  --events.out.tfevents.1502264231.master-d3b4c71824-0-2733w
  --events.out.tfevents.1502267118.master-7f8d859ac5-0-r5h8s
  --events.out.tfevents.1502282824.master-acb4b4f78d-0-9d1mw
  --events.out.tfevents.1502285815.master-1ef3af1094-0-lh9dx
  --graph.pbtxt
  --model.ckpt-0.data-00000-of-00001
  --model.ckpt-0.index
  --model.ckpt-0.meta
  --packages/

I run the job using following command (using a windows cmd [^ should equal ]:

gcloud ml-engine jobs submit training stefan_object_detection_09_08_2017i ^
--job-dir=gs://cartrainingbucket/train ^
--packages dist/object_detection-0.1.tar.gz,slim/dist/slim-0.1.tar.gz ^
--module-name object_detection.train ^
--region us-central1 ^
--config object_detection/samples/cloud/cloud.yml ^
-- ^
--train_dir=gs://cartrainingbucket/train ^
--pipeline_config_path=gs://cartrainingbucket/data/faster_rcnn_resnet101.config

the cloud.yml is the default one:

trainingInput:
  runtimeVersion: "1.0" # i also tried 1.2, in this case the failure appeared earlier in training
  scaleTier: CUSTOM
  masterType: standard_gpu
  workerCount: 5
  workerType: standard_gpu
  parameterServerCount: 3
  parameterServerType: standard

I'm using the currently latest Tensorflow Model master branch version (commit 36203f09dc257569be2fef3a950ddb2ac25dddeb). My locally installed TF version is 1.2 and I'm using python 3.5.1.

My training and validation records both work locally for training.

For me, as a Newbie, it's hard to see the problem's source. I'd be happy for any advice.

标签： tensorflow google-compute-engine object-detection google-cloud-ml-engine

2条回答

神经病院院长

2楼-- · 2019-08-26 06:26

Update: The job failed due to out-of-memory. Try to use larger machine instead please.

In addition to rhaertel80's answer, it will be also helpful if you can share the project number and job id with us via cloudml-feedback@google.com.

0人赞添加讨论(0) 举报

放荡不羁爱自由

3楼-- · 2019-08-26 06:28

One possibility is that TF processes are using to much resources (usually memory) and being killed by the OS. This would explain the connection reset by peer.

So one thing to try would be using a tier/machines with more resources.

0人赞添加讨论(0) 举报

"Connection reset by peer on adapted standard ML-E

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间