Working on a project using Tensorflow. However, I can't seem to reproduce my results.
I have tried setting the graph level seed, numpy random seed and even operation level seeds. However, it still not reproducible.
On searching Google, most people point to the reduce_sum function as the culprit as the reduce_sum function has a non-deterministic property on gpu even after setting the seeds. However, since I am working on a project for a paper, I need to reproduce the results. Is there any other efficient function that can work around this?
Another suggestion was to use CPU. However, I'm working on bug data and such CPU is not an option. How do people working on complex projects using Tensorflow work around this issue? Or it is acceptable to reviewers to load the saved model checkpoint file for result verification?
Cool, that you want to make your results reproducible! However, there are many things to note here:
I call a paper reproducible if one can obtain exactly the same numbers as found in the paper by executing exactly the same steps. This means if one had access to the same environment, the same software, hardware and data, one would be able to get the same results. In contrast, a paper is called replicatable if one can achieve the same results if one only follows the textual description in the paper. Hence replicability is harder to achieve, but also a more powerful indicator of the quality of the paper
You want to achieve that the training results on a bit-wise identical model. The holy grail would be to write your paper in a way that if people ONLY have the paper, they can still confirm your results.
Please also note that in many important papers results are practically impossible to reproduce:
If that is a problem, depends very much on the context. As a comparison, think of CERN / LHC: It is impossible to have completely identical experiments. Only very few institutions on earth have the instruments to check the results. Still it is not a problem. So ask your advisor / people who have already published in that journal / conference.
Achieving Replicatability
This is super hard. I think the following is helpful:
Getting Bit-Wise identical Model
It seems to me that you already do the important things:
numpy
,tensorflow
,random
, ...Please note that there might be factors out of your control: