ml-engine vague error: “grpc epoll fd: 3”

2019-05-11 00:46发布

I'm trying to train with gcloud ml-engine jobs submit training, and job is getting stuck with the following output on logs:

enter image description here

My config.yaml:

trainingInput:
  scaleTier: CUSTOM
  masterType: standard_gpu
  workerType: standard_gpu
  parameterServerType: large_model
  workerCount: 1
  parameterServerCount: 1

Any hints about what "grpc epoll fd: 3" means and how to fix that? My input function is feeding a 16G TFRecord from gs://, but with batch = 4, shuffle buffer_size = 4. Each input sample is a single channel 99 x 161px image: shape (15939,) - not huge.

Thanks

标签： machine-learning tensorflow grpc google-cloud-ml

1条回答

虎瘦雄心在

2楼-- · 2019-05-11 01:03

Maybe this is a bug in the Estimator implementation, not sure. The solution for now is to use tf.estimator.train_and_eval as suggested by @guoqing-xu

Working sample

train_input_fn = gen_input(FLAGS.train_input)
eval_input_fn = gen_input(FLAGS.eval_input)

model_params = {
  'learning_rate': FLAGS.learning_rate,
}

estimator = tf.estimator.Estimator(model_dir=model_dir, model_fn=model_fn, params=model_params)
train_spec = tf.estimator.TrainSpec(input_fn=train_input_fn, max_steps=1000)
eval_spec = tf.estimator.EvalSpec(input_fn=eval_input_fn, steps=None, start_delay_secs=30, throttle_secs=30)

tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)

0人赞添加讨论(0) 举报

ml-engine vague error: “grpc epoll fd: 3”

Working sample

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间