Distributed Tensorflow: check failed: size>=0

2019-05-08 04:46发布

问题:

I'm using keras 2.0.6. The version of tensorflow is 1.3.0.

My code can run with theano backend, but failed with tensorflow backend:

F tensorflow/core/framework/tensor_shape.cc:241] Check failed: size >= 0 (-14428307456 vs. 0)

I was wondering if anyone can thought of any possible reason that might cause this.

Thank you!

----UPDATE-----

I tested exactly the same code on my PC with tensorflow. It runs perfectly.

However, it throw out this error when I run it on a Supercomputer.

Although this error looks like overflow, there is no way that it didn't overflow on my PC, but overflow on a supercomputer.

I suspect that it comes from a bug on tensorflow for distributed computation.

回答1:

it came out the same bug, but it ran ok after that I shrimped the batch size.

I think the reason is it running out of GPU memories.