I have multiple 4GB GPU nodes so I want them to run huge model in parallel. I hope just splitting layers into several pieces with appropriate device scopes just enables model parallelism but it turns out that it doesn't reduce memory footprint for master node(task 0). (10 nodes configuration - master: 20g, followers:2g, 1 node configuration - master: 6~7g)
Suspicious one is that gradients are not distributed because I didn't setup right device scope for them.
my model is available on github.(https://github.com/nakosung/tensorflow-wavenet/tree/model_parallel_2)
device placement log is here: https://gist.github.com/nakosung/a38d4610fff09992f7e5569f19eefa57
So the good news is that you using colocate_gradients_with_ops, which means that you are ensuring that the gradients are being computed on the same device that the ops are placed. (https://github.com/nakosung/tensorflow-wavenet/blob/model_parallel_2/train.py#L242)
Reading the device placement log is a little difficult, so I would suggest using TensorBoard to try visualizing the graph. It has options to be able to visualize how nodes are being placed on devices.
Secondly, you can try to see how the sizes of your operations map down to devices -- it is possible that the largest layers (largest activations, or largest weights) may be disproportionately placed more on some nodes than others. You might try to use https://github.com/tensorflow/tensorflow/blob/6b1d4fd8090d44d20fdadabf06f1a9b178c3d80c/tensorflow/python/tools/graph_metrics.py to analyze your graph to get a better picture of where resources are required in your graph.
Longer term we'd like to try to solve some of these placement problems automatically, but so far model parallelism requires a bit of care to place things precisely.