While running kubeflow pipeline having code that uses tensorflow 2.0. below error is displayed at end of each epoch
W tensorflow/core/kernels/data/generator_dataset_op.cc:103] Error occurred when finalizing GeneratorDataset iterator: Cancelled: Operation was cancelled
Also, after some epochs, it does not show log and shows this error
This step is in Failed state with this message: The node was low on resource: memory. Container main was using 100213872Ki, which exceeds its request of 0. Container wait was using 25056Ki, which exceeds its request of 0.
In my case, I didn't match the
batch_size
andsteps_per_epoch
For example,
batch_size
in the datagen.flow must correspond to thesteps_per_epoch
in Test_model.fit_generator (actually, I used the wrong value on thesteps_per_epoch
)This is one of the cases for the Error, I guess.
As a result, I think the problem arises when there is wrong correspondence on the batch size and steps(iterations)
Maybe the floats can be a problem when you get the step by dividing...
Check your code about this issue.
Good luck :)
In my case: I installed tf-nightly. Now it's working, Though I am new to tensorflow. I followed this link
You can try.
I have the same problem. People claimed that warming is superfluous and it has been removed in the tf-nightly, see here. But the memory leak is still there for each epoch.
This was due to incompatible CUDA and Tensorflow versions. below versions work well with each other
Upgrading
tensorflow
from2.1
to2.2
fixed this issue for me. I didn't have to go totf-nightly
version.