Tensorflow Java Multi-GPU inference

I have a server with multiple GPUs and want to make full use of them during model inference inside a java app. By default tensorflow seizes all available GPUs, but uses only the first one.

I can think of three options to overcome this issue:

Restrict device visibility on process level, namely using CUDA_VISIBLE_DEVICES environment variable.

That would require me to run several instances of the java app and distribute traffic among them. Not that tempting idea.

Launch several sessions inside a single application and try to assign one device to each of them via ConfigProto:

public class DistributedPredictor {

    private Predictor[] nested;
    private int[] counters;

    // ...

    public DistributedPredictor(String modelPath, int numDevices, int numThreadsPerDevice) {
        nested = new Predictor[numDevices];
        counters = new int[numDevices];

        for (int i = 0; i < nested.length; i++) {
            nested[i] = new Predictor(modelPath, i, numDevices, numThreadsPerDevice);
        }
    }

    public Prediction predict(Data data) {
        int i = acquirePredictorIndex();
        Prediction result = nested[i].predict(data);
        releasePredictorIndex(i);
        return result;
    }

    private synchronized int acquirePredictorIndex() {
        int i = argmin(counters);
        counters[i] += 1;
        return i;
    }

    private synchronized void releasePredictorIndex(int i) {
        counters[i] -= 1;
    }
}


public class Predictor {

    private Session session;

    public Predictor(String modelPath, int deviceIdx, int numDevices, int numThreadsPerDevice) {

        GPUOptions gpuOptions = GPUOptions.newBuilder()
                .setVisibleDeviceList("" + deviceIdx)
                .setAllowGrowth(true)
                .build();

        ConfigProto config = ConfigProto.newBuilder()
                .setGpuOptions(gpuOptions)
                .setInterOpParallelismThreads(numDevices * numThreadsPerDevice)
                .build();

        byte[] graphDef = Files.readAllBytes(Paths.get(modelPath));
        Graph graph = new Graph();
        graph.importGraphDef(graphDef);

        this.session = new Session(graph, config.toByteArray());
    }

    public Prediction predict(Data data) {
        // ...
    }
}

This approach seems to work fine at a glance. However, sessions occasionally ignore setVisibleDeviceList option and all go for the first device causing Out-Of-Memory crash.

Build the model in a multi-tower fashion in python using tf.device() specification. On java side, give different Predictors different towers inside a shared session.

Feels cumbersome and idiomatically wrong to me.

UPDATE: As @ash proposed, there's yet another option:

Assign an appropriate device to each operation of the existing graph by modifying its definition (graphDef).

To get it done, one could adapt the code from Method 2:

public class Predictor {

    private Session session;

    public Predictor(String modelPath, int deviceIdx, int numDevices, int numThreadsPerDevice) {

        byte[] graphDef = Files.readAllBytes(Paths.get(modelPath));
        graphDef = setGraphDefDevice(graphDef, deviceIdx)

        Graph graph = new Graph();
        graph.importGraphDef(graphDef);

        ConfigProto config = ConfigProto.newBuilder()
                .setAllowSoftPlacement(true)
                .build();

        this.session = new Session(graph, config.toByteArray());
    }

    private static byte[] setGraphDefDevice(byte[] graphDef, int deviceIdx) throws InvalidProtocolBufferException {
        String deviceString = String.format("/gpu:%d", deviceIdx);

        GraphDef.Builder builder = GraphDef.parseFrom(graphDef).toBuilder();
        for (int i = 0; i < builder.getNodeCount(); i++) {
            builder.getNodeBuilder(i).setDevice(deviceString);
        }
        return builder.build().toByteArray();
    }

    public Prediction predict(Data data) {
        // ...
    }
}

Just like other mentioned approaches, this one doesn't set me free from manually distributing data among devices. But at least it works stably and is comparably easy to implement. Overall, this looks like an (almost) normal technique.

Is there an elegant way to do such a basic thing with tensorflow java API? Any ideas would be appreciated.

In short: There is a workaround, where you end up with one session per GPU.

Details:

The general flow is that the TensorFlow runtime respects the devices specified for operations in the graph. If no device is specified for an operation, then it "places" it based on some heuristics. Those heuristics currently result in "place operation on GPU:0 if GPUs are available and there is a GPU kernel for the operation" (Placer::Run in case you're interested).

What you ask for I think is a reasonable feature request for TensorFlow - the ability to treat devices in the serialized graph as "virtual" ones to be mapped to a set of "phyiscal" devices at run time, or alternatively setting the "default device". This feature does not currently exist. Adding such an option to ConfigProto is something you may want to file a feature request for.

I can suggest a workaround in the interim. First, some commentary on your proposed solutions.

Your first idea will surely work, but as you pointed out, is cumbersome.
Setting using visible_device_list in the ConfigProto doesn't quite work out since that is actually a per-process setting and is ignored after the first session is created in the process. This is certainly not documented as well as it should be (and somewhat unfortunate that this appears in the per-Session configuration). However, this explains why your suggestion here doesn't work and why you still see a single GPU being used.
This could work.

Another option is to end up with different graphs (with operations explicitly placed on different GPUs), resulting in one session per GPU. Something like this can be used to edit the graph and explicitly assign a device to each operation:

public static byte[] modifyGraphDef(byte[] graphDef, String device) throws Exception {
  GraphDef.Builder builder = GraphDef.parseFrom(graphDef).toBuilder();
  for (int i = 0; i < builder.getNodeCount(); ++i) {
    builder.getNodeBuilder(i).setDevice(device);
  }
  return builder.build().toByteArray();
}

After which you could create a Graph and Session per GPU using something like:

final int NUM_GPUS = 8;
// setAllowSoftPlacement: Just in case our device modifications were too aggressive
// (e.g., setting a GPU device on an operation that only has CPU kernels)
// setLogDevicePlacment: So we can see what happens.
byte[] config =
    ConfigProto.newBuilder()
        .setLogDevicePlacement(true)
        .setAllowSoftPlacement(true)
        .build()
        .toByteArray();
Graph graphs[] = new Graph[NUM_GPUS];
Session sessions[] = new Session[NUM_GPUS];
for (int i = 0; i < NUM_GPUS; ++i) {
  graphs[i] = new Graph();
  graphs[i].importGraphDef(modifyGraphDef(graphDef, String.format("/gpu:%d", i)));
  sessions[i] = new Session(graphs[i], config);    
}

Then use sessions[i] to execute the graph on GPU #i.

Hope that helps.