This is on windows 10 computer with no monitor attached to the Nvidia card. I've included output from nvida-smi showing > 5.04G was available.
Here is the tensorflow code asking it to allocate just slightly more than I had seen previously: (I want this to be as close as possible to memory fraction=1.0)
config = tf.ConfigProto()
#config.gpu_options.allow_growth=True
config.gpu_options.per_process_gpu_memory_fraction=0.84
config.log_device_placement=True
sess = tf.Session(config=config)
Just before running the above line in a jupyter notebook I ran nvida-smi:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 376.51 Driver Version: 376.51 |
|-------------------------------+----------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 106... WDDM | 0000:01:00.0 Off | N/A |
| 0% 27C P8 5W / 120W | 43MiB / 6144MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Output from TF after it successfully allocates 5.01GB, shows "failed to allocate 5.04G (5411658752 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY" (you need to scroll to the right to see it below)
2017-12-17 03:53:13.959871: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\common_runtime\gpu\gpu_device.cc:1030] Found device 0 with properties:
name: GeForce GTX 1060 6GB major: 6 minor: 1 memoryClockRate(GHz): 1.7845
pciBusID: 0000:01:00.0
totalMemory: 6.00GiB freeMemory: 5.01GiB
2017-12-17 03:53:13.960006: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\common_runtime\gpu\gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce GTX 1060 6GB, pci bus id: 0000:01:00.0, compute capability: 6.1)
2017-12-17 03:53:13.961152: E C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\stream_executor\cuda\cuda_driver.cc:936] failed to allocate 5.04G (5411658752 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
Device mapping:
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: GeForce GTX 1060 6GB, pci bus id: 0000:01:00.0, compute capability: 6.1
2017-12-17 03:53:14.151073: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\common_runtime\direct_session.cc:299] Device mapping:
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: GeForce GTX 1060 6GB, pci bus id: 0000:01:00.0, compute capability: 6.1
My best guess is some policy in an Nvidia user level dll is preventing use of all of the memory (perhaps to allow for attaching a monitor?)
If that theory is correct I'm looking for any user accessible knob to turn that off on windows 10. If I'm on the wrong track any help to point in the right direction is appreciated.
Edit #1:
I realized I did not include this bit of research: The following code in tensorflow indicates stream_exec is 'telling' TensorFlow that only 5.01GB is free. This is the primary reason for my current theory that some Nvidia component is preventing the allocation. (However I could be misunderstanding what component implements the instantiated stream_exec.)
auto stream_exec = executor.ValueOrDie();
int64 free_bytes;
int64 total_bytes;
if (!stream_exec->DeviceMemoryUsage(&free_bytes, &total_bytes)) {
// Logs internally on failure.
free_bytes = 0;
total_bytes = 0;
}
const auto& description = stream_exec->GetDeviceDescription();
int cc_major;
int cc_minor;
if (!description.cuda_compute_capability(&cc_major, &cc_minor)) {
// Logs internally on failure.
cc_major = 0;
cc_minor = 0;
}
LOG(INFO) << "Found device " << i << " with properties: "
<< "\nname: " << description.name() << " major: " << cc_major
<< " minor: " << cc_minor
<< " memoryClockRate(GHz): " << description.clock_rate_ghz()
<< "\npciBusID: " << description.pci_bus_id() << "\ntotalMemory: "
<< strings::HumanReadableNumBytes(total_bytes)
<< " freeMemory: " << strings::HumanReadableNumBytes(free_bytes);
}
Edit #2:
The thread below indicates Windows 10 is preventing full use of VRAM pervasively across secondary video cards used for compute by grabbing a % of the VRAM: https://social.technet.microsoft.com/Forums/windows/en-US/15b9654e-5da7-45b7-93de-e8b63faef064/windows-10-does-not-let-cuda-applications-to-use-all-vram-on-especially-secondary-graphics-cards?forum=win10itprohardware
This thread seems implausible given it would mean all windows 10 boxes are inherently worse than windows 7 for anything where VRAM on compute dedicated graphics cards could plausibly be the bottleneck.
Edit #3:
Update title to more clearly be a question. Feedback indicates this may be better as a bug to Microsoft or Nvidia. I am pursuing other avenues to get this addressed. However I don't want to assume this cannot be resolved directly.
Further experiments do indicate that the issue I am hitting is for the case of a large allocation from a single process. All of the VRAM can be used when another process comes into play.
Edit #4
The failure here is an allocation failure, and according to the NVIDIA-SMI above I have 43MiB in use (perhaps by the system?), but not by an identifiable process. The type of failure I'm seeing is of a monolithic single allocation. Under a typical allocation model that requires a continuous address space. So the pertinent question may be: What is causing that 43MiB to be used? Is that placed in the address space such that the 5.01 GB allocation is the max contiguous space available?