While running kubeflow pipeline having code that uses tensorflow 2.0. below error is displayed at end of each epoch
W tensorflow/core/kernels/data/generator_dataset_op.cc:103] Error occurred when finalizing GeneratorDataset iterator: Cancelled: Operation was cancelled
Also, after some epochs, it does not show log and shows this error
This step is in Failed state with this message: The node was low on resource: memory. Container main was using 100213872Ki, which exceeds its request of 0. Container wait was using 25056Ki, which exceeds its request of 0.
In my case, I didn't match the batch_size
and steps_per_epoch
For example,
his = Test_model.fit_generator(datagen.flow(trainrancrop_images, trainrancrop_labels, batch_size=batchsize), steps_per_epoch=len(trainrancrop_images)/batchsize, validation_data=(test_images, test_labels), epochs=1, callbacks=[callback])
batch_size
in the datagen.flow must correspond to the steps_per_epoch
in Test_model.fit_generator
(actually, I used the wrong value on the steps_per_epoch
)
This is one of the cases for the Error, I guess.
As a result, I think the problem arises when there is wrong correspondence on the batch size and steps(iterations)
Maybe the floats can be a problem when you get the step by dividing...
Check your code about this issue.
Good luck :)
This was due to incompatible CUDA and Tensorflow versions.
below versions work well with each other
tensorflow-gpu==2.0.0
tensorflow-addons==0.6.0
nvidia/cuda:10.0-cudnn7-runtime
In my case:
I installed tf-nightly.
Now it's working, Though I am new to tensorflow. I followed this link
You can try.
I have the same problem. People claimed that warming is superfluous and it has been removed in the tf-nightly, see here. But the memory leak is still there for each epoch.
Upgrading tensorflow
from 2.1
to 2.2
fixed this issue for me. I didn't have to go to tf-nightly
version.