I'm trying to train with gcloud ml-engine jobs submit training
, and job is getting stuck with the following output on logs:
My config.yaml:
trainingInput:
scaleTier: CUSTOM
masterType: standard_gpu
workerType: standard_gpu
parameterServerType: large_model
workerCount: 1
parameterServerCount: 1
Any hints about what "grpc epoll fd: 3" means and how to fix that? My input function is feeding a 16G TFRecord from gs://, but with batch = 4, shuffle buffer_size = 4. Each input sample is a single channel 99 x 161px image: shape (15939,) - not huge.
Thanks
Maybe this is a bug in the Estimator implementation, not sure. The solution for now is to use
tf.estimator.train_and_eval
as suggested by @guoqing-xuWorking sample