I am running a convolutional neural network on AWS instance g2.2xlarge. The model runs fine with 30000 images of size 64x64. However, when I try to run it with images of size 128x128, it gives memory error (see below) even when I only input 1 image (which has 2 channels - real and imaginary).
Because the error mentions tensor of shape [32768,16384], I assume it happens during the first (fully-connected) layer, which takes input image with two channels 128*128*2 = 32768 and outputs 128*128 = 16384 vector.
I found recommendations to decrease the batch size, however, I already use 1 input image only.
Here it is written that using cudnn one could get up to 700-900px on the same AWS instance that I use (although, I do not know if they use fully-connected layers). I tried two different AMIs (1 and 2), both with cudnn installed, but still got memory error.
My questions are:
1. How do I calculate how much memory is needed for a [32768,16384] tensor? I am not a computer scientist, so I would appreciate a detailed reply.
2. I guess I am trying to understand whether the instance I use really has too little memory for my data (g2.2xlarge has 15 GiB) or I am just doing something wrong.
Error:
2018-01-24 16:36:53.666427: I
tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports
instructions that this TensorFlow binary was not compiled to use: SSE4.1
SSE4.2 AVX
2018-01-24 16:36:55.069050: I
tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:895] successful NUMA node
read from SysFS had negative value (-1), but there must be at least one NUMA
node, so returning NUMA node zero
2018-01-24 16:36:55.069287: I
tensorflow/core/common_runtime/gpu/gpu_device.cc:1062] Found device 0 with
properties:
name: GRID K520 major: 3 minor: 0 memoryClockRate(GHz): 0.797
pciBusID: 0000:00:03.0
totalMemory: 3.94GiB freeMemory: 3.90GiB
2018-01-24 16:36:55.069316: I
tensorflow/core/common_runtime/gpu/gpu_device.cc:1152] Creating TensorFlow
device (/device:GPU:0) -> (device: 0, name: GRID K520, pci bus id:
0000:00:03.0, compute capability: 3.0)
2018-01-24 16:37:59.766001: W
tensorflow/core/common_runtime/bfc_allocator.cc:273] Allocator (GPU_0_bfc) ran
out of memory trying to allocate 2.00GiB. Current allocation summary follows.
2018-01-24 16:37:59.766054: I
tensorflow/core/common_runtime/bfc_allocator.cc:628] Bin (256): Total
Chunks: 10, Chunks in use: 10. 2.5KiB allocated for chunks. 2.5KiB in use in
bin. 40B client-requested in use in bin.
2018-01-24 16:37:59.766070: I
tensorflow/core/common_runtime/bfc_allocator.cc:628] Bin (512): Total
Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B
client-requested in use in bin.
2018-01-24 16:37:59.766084: I
tensorflow/core/common_runtime/bfc_allocator.cc:628] Bin (1024): Total
Chunks: 1, Chunks in use: 1. 1.2KiB allocated for chunks. 1.2KiB in use in
bin. 1.0KiB client-requested in use in bin.
2018-01-24 16:37:59.766094: I
tensorflow/core/common_runtime/bfc_allocator.cc:628] Bin (2048): Total
Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B
client-requested in use in bin.
2018-01-24 16:37:59.766108: I
tensorflow/core/common_runtime/bfc_allocator.cc:628] Bin (4096): Total
Chunks: 2, Chunks in use: 2. 12.5KiB allocated for chunks. 12.5KiB in use in
bin. 12.5KiB client-requested in use in bin.
2018-01-24 16:37:59.766122: I
tensorflow/core/common_runtime/bfc_allocator.cc:628] Bin (8192): Total
Chunks: 2, Chunks in use: 2. 24.5KiB allocated for chunks. 24.5KiB in use in
bin. 24.5KiB client-requested in use in bin.
2018-01-24 16:37:59.766134: I
tensorflow/core/common_runtime/bfc_allocator.cc:628] Bin (16384): Total
Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B
client-requested in use in bin.
2018-01-24 16:37:59.766143: I
tensorflow/core/common_runtime/bfc_allocator.cc:628] Bin (32768): Total
Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B
client-requested in use in bin.
2018-01-24 16:37:59.766155: I
tensorflow/core/common_runtime/bfc_allocator.cc:628] Bin (65536): Total
Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B
client-requested in use in bin.
2018-01-24 16:37:59.766163: I
tensorflow/core/common_runtime/bfc_allocator.cc:628] Bin (131072): Total
Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B
client-requested in use in bin.
2018-01-24 16:37:59.766177: I
tensorflow/core/common_runtime/bfc_allocator.cc:628] Bin (262144): Total
Chunks: 2, Chunks in use: 2. 800.0KiB allocated for chunks. 800.0KiB in use in
bin. 800.0KiB client-requested in use in bin.
2018-01-24 16:37:59.766196: I
tensorflow/core/common_runtime/bfc_allocator.cc:628] Bin (524288): Total
Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B
client-requested in use in bin.
2018-01-24 16:37:59.766208: I
tensorflow/core/common_runtime/bfc_allocator.cc:628] Bin (1048576): Total
Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B
client-requested in use in bin.
2018-01-24 16:37:59.766221: I
tensorflow/core/common_runtime/bfc_allocator.cc:628] Bin (2097152): Total
Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B
client-requested in use in bin.
2018-01-24 16:37:59.766230: I
tensorflow/core/common_runtime/bfc_allocator.cc:628] Bin (4194304): Total
Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B
client-requested in use in bin.
2018-01-24 16:37:59.766241: I
tensorflow/core/common_runtime/bfc_allocator.cc:628] Bin (8388608): Total
Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B
client-requested in use in bin.
2018-01-24 16:37:59.766250: I
tensorflow/core/common_runtime/bfc_allocator.cc:628] Bin (16777216): Total
Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B
client-requested in use in bin.
2018-01-24 16:37:59.766262: I
tensorflow/core/common_runtime/bfc_allocator.cc:628] Bin (33554432): Total
Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B
client-requested in use in bin.
2018-01-24 16:37:59.766271: I
tensorflow/core/common_runtime/bfc_allocator.cc:628] Bin (67108864): Total
Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B
client-requested in use in bin.
2018-01-24 16:37:59.766282: I
tensorflow/core/common_runtime/bfc_allocator.cc:628] Bin (134217728): Total
Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B
client-requested in use in bin.
2018-01-24 16:37:59.766292: I
tensorflow/core/common_runtime/bfc_allocator.cc:628] Bin (268435456): Total
Chunks: 2, Chunks in use: 1. 3.57GiB allocated for chunks. 2.00GiB in use in
bin. 2.00GiB client-requested in use in bin.
2018-01-24 16:37:59.766304: I
tensorflow/core/common_runtime/bfc_allocator.cc:644] Bin for 2.00GiB was
256.00MiB, Chunk State:
2018-01-24 16:37:59.766335: I
tensorflow/core/common_runtime/bfc_allocator.cc:650] Size: 1.57GiB |
Requested Size: 0B | in_use: 0, prev: Size: 2.00GiB | Requested Size:
2.00GiB | in_use: 1
2018-01-24 16:37:59.766358: I
tensorflow/core/common_runtime/bfc_allocator.cc:662] Chunk at 0x702680000 of
size 1280
2018-01-24 16:37:59.766374: I
tensorflow/core/common_runtime/bfc_allocator.cc:662] Chunk at 0x702680500 of
size 256
2018-01-24 16:37:59.766381: I
tensorflow/core/common_runtime/bfc_allocator.cc:662] Chunk at 0x702680600 of
size 256
2018-01-24 16:37:59.766387: I
tensorflow/core/common_runtime/bfc_allocator.cc:662] Chunk at 0x702680700 of
size 256
2018-01-24 16:37:59.766397: I
tensorflow/core/common_runtime/bfc_allocator.cc:662] Chunk at 0x702680800 of
size 256
2018-01-24 16:37:59.766402: I
tensorflow/core/common_runtime/bfc_allocator.cc:662] Chunk at 0x702680900 of
size 256
2018-01-24 16:37:59.766412: I
tensorflow/core/common_runtime/bfc_allocator.cc:662] Chunk at 0x702680a00 of
size 256
2018-01-24 16:37:59.766422: I
tensorflow/core/common_runtime/bfc_allocator.cc:662] Chunk at 0x702680b00 of
size 256
2018-01-24 16:37:59.766429: I
tensorflow/core/common_runtime/bfc_allocator.cc:662] Chunk at 0x702680c00 of
size 256
2018-01-24 16:37:59.766435: I
tensorflow/core/common_runtime/bfc_allocator.cc:662] Chunk at 0x702680d00 of
size 256
2018-01-24 16:37:59.766459: I
tensorflow/core/common_runtime/bfc_allocator.cc:662] Chunk at 0x702680e00 of
size 256
2018-01-24 16:37:59.766471: I
tensorflow/core/common_runtime/bfc_allocator.cc:662] Chunk at 0x702680f00 of
size 6400
2018-01-24 16:37:59.766477: I
tensorflow/core/common_runtime/bfc_allocator.cc:662] Chunk at 0x702682800 of
size 6400
2018-01-24 16:37:59.766482: I
tensorflow/core/common_runtime/bfc_allocator.cc:662] Chunk at 0x702684100 of
size 409600
2018-01-24 16:37:59.766492: I
tensorflow/core/common_runtime/bfc_allocator.cc:662] Chunk at 0x7026e8100 of
size 409600
2018-01-24 16:37:59.766499: I
tensorflow/core/common_runtime/bfc_allocator.cc:662] Chunk at 0x70274c100 of
size 12544
2018-01-24 16:37:59.766509: I
tensorflow/core/common_runtime/bfc_allocator.cc:662] Chunk at 0x70274f200 of
size 12544
2018-01-24 16:37:59.766517: I
tensorflow/core/common_runtime/bfc_allocator.cc:662] Chunk at 0x702752300 of
size 2147483648
2018-01-24 16:37:59.766523: I
tensorflow/core/common_runtime/bfc_allocator.cc:671] Free at 0x782752300 of
size 1684724992
2018-01-24 16:37:59.766530: I
tensorflow/core/common_runtime/bfc_allocator.cc:677] Summary of in-use
Chunks by size:
2018-01-24 16:37:59.766543: I
tensorflow/core/common_runtime/bfc_allocator.cc:680] 10 Chunks of size 256
totalling 2.5KiB
2018-01-24 16:37:59.766557: I
tensorflow/core/common_runtime/bfc_allocator.cc:680] 1 Chunks of size 1280
totalling 1.2KiB
2018-01-24 16:37:59.766569: I
tensorflow/core/common_runtime/bfc_allocator.cc:680] 2 Chunks of size 6400
totalling 12.5KiB
2018-01-24 16:37:59.766577: I
tensorflow/core/common_runtime/bfc_allocator.cc:680] 2 Chunks of size 12544
totalling 24.5KiB
2018-01-24 16:37:59.766585: I
tensorflow/core/common_runtime/bfc_allocator.cc:680] 2 Chunks of size 409600
totalling 800.0KiB
2018-01-24 16:37:59.766596: I
tensorflow/core/common_runtime/bfc_allocator.cc:680] 1 Chunks of size
2147483648 totalling 2.00GiB
2018-01-24 16:37:59.766606: I
tensorflow/core/common_runtime/bfc_allocator.cc:684] Sum Total of in-use
chunks: 2.00GiB
2018-01-24 16:37:59.766620: I
tensorflow/core/common_runtime/bfc_allocator.cc:686] Stats:
Limit: 3833069568
InUse: 2148344576
MaxInUse: 2148344576
NumAllocs: 18
MaxAllocSize: 2147483648
2018-01-24 16:37:59.766635: W
tensorflow/core/common_runtime/bfc_allocator.cc:277]
2018-01-24 16:37:59.766660: W tensorflow/core/framework/op_kernel.cc:1188]
Resource exhausted: OOM when allocating tensor of shape [32768,16384] and type
float
2018-01-24 16:38:00.828932: E tensorflow/core/common_runtime/executor.cc:651]
Executor failed to create kernel. Resource exhausted: OOM when allocating
tensor of shape [32768,16384] and type float
[[Node: fc1/weights/RMSProp_1/Initializer/zeros = Const[_class=
["loc:@fc1/weights"], dtype=DT_FLOAT, value=Tensor<type: float shape:
[32768,16384] values: [0 0 0]...>,
_device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]
Traceback (most recent call last):
File "myAutomap.py", line 278, in <module>
print_cost=True)
File "myAutomap.py", line 240, in model
sess.run(init)
File "/usr/lib/python2.7/dist-packages/tensorflow/python/client/session.py",
line 889, in run
run_metadata_ptr)
File "/usr/lib/python2.7/dist-packages/tensorflow/python/client/session.py",
line 1120, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/lib/python2.7/dist-packages/tensorflow/python/client/session.py",
line 1317, in _do_run
options, run_metadata)
File "/usr/lib/python2.7/dist-packages/tensorflow/python/client/session.py",
line 1336, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when
allocating tensor of shape [32768,16384] and type float
[[Node: fc1/weights/RMSProp_1/Initializer/zeros = Const[_class=
["loc:@fc1/weights"], dtype=DT_FLOAT, value=Tensor<type: float shape:
[32768,16384] values: [0 0 0]...>,
_device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]
Caused by op u'fc1/weights/RMSProp_1/Initializer/zeros', defined at:
File "myAutomap.py", line 278, in <module>
print_cost=True)
File "myAutomap.py", line 228, in model
optimizer = tf.train.RMSPropOptimizer(learning_rate).minimize(cost)
File "/usr/lib/python2.7/dist-
packages/tensorflow/python/training/optimizer.py", line 365, in minimize
name=name)
File "/usr/lib/python2.7/dist-
packages/tensorflow/python/training/optimizer.py", line 516, in
apply_gradients
self._create_slots([_get_variable_for(v) for v in var_list])
File "/usr/lib/python2.7/dist-packages/tensorflow/python/training/rmsprop.py",
line 113, in _create_slots
self._zeros_slot(v, "momentum", self._name)
File "/usr/lib/python2.7/dist-
packages/tensorflow/python/training/optimizer.py", line 882, in _zeros_slot
named_slots[_var_key(var)] = slot_creator.create_zeros_slot(var, op_name)
File "/usr/lib/python2.7/dist-
packages/tensorflow/python/training/slot_creator.py", line 174, in
create_zeros_slot
colocate_with_primary=colocate_with_primary)
File "/usr/lib/python2.7/dist-
packages/tensorflow/python/training/slot_creator.py", line 148, in
create_slot_with_initializer
dtype)
File "/usr/lib/python2.7/dist-
packages/tensorflow/python/training/slot_creator.py", line 67, in
_create_slot_var
validate_shape=validate_shape)
File "/usr/lib/python2.7/dist-
packages/tensorflow/python/ops/variable_scope.py", line 1256, in get_variable
constraint=constraint)
File "/usr/lib/python2.7/dist-
packages/tensorflow/python/ops/variable_scope.py", line 1097, in get_variable
constraint=constraint)
File "/usr/lib/python2.7/dist-
packages/tensorflow/python/ops/variable_scope.py", line 435, in get_variable
constraint=constraint)
File "/usr/lib/python2.7/dist-
packages/tensorflow/python/ops/variable_scope.py", line 404, in _true_getter
use_resource=use_resource, constraint=constraint)
File "/usr/lib/python2.7/dist-
packages/tensorflow/python/ops/variable_scope.py", line 806, in
_get_single_variable
constraint=constraint)
File "/usr/lib/python2.7/dist-packages/tensorflow/python/ops/variables.py",
line 229, in __init__
constraint=constraint)
File "/usr/lib/python2.7/dist-packages/tensorflow/python/ops/variables.py",
line 323, in _init_from_args
initial_value(), name="initial_value", dtype=dtype)
File "/usr/lib/python2.7/dist-
packages/tensorflow/python/ops/variable_scope.py", line 780, in <lambda>
shape.as_list(), dtype=dtype, partition_info=partition_info)
File "/usr/lib/python2.7/dist-packages/tensorflow/python/ops/init_ops.py",
line 93, in __call__
return array_ops.zeros(shape, dtype)
File "/usr/lib/python2.7/dist-packages/tensorflow/python/ops/array_ops.py",
line 1509, in zeros
output = constant(zero, shape=shape, dtype=dtype, name=name)
File "/usr/lib/python2.7/dist-
packages/tensorflow/python/framework/constant_op.py", line 218, in constant
name=name).outputs[0]
File "/usr/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py",
line 3069, in create_op
op_def=op_def)
File "/usr/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py",
line 1579, in __init__
self._traceback = self._graph._extract_stack() # pylint: disable=protected-
access
ResourceExhaustedError (see above for traceback): OOM when allocating tensor
of shape [32768,16384] and type float
[[Node: fc1/weights/RMSProp_1/Initializer/zeros = Const[_class=
["loc:@fc1/weights"], dtype=DT_FLOAT, value=Tensor<type: float shape:
[32768,16384] values: [0 0 0]...>,
_device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]
Errore di segmentazione