Memory error with larger images when running convo

I am running a convolutional neural network on AWS instance g2.2xlarge. The model runs fine with 30000 images of size 64x64. However, when I try to run it with images of size 128x128, it gives memory error (see below) even when I only input 1 image (which has 2 channels - real and imaginary).
Because the error mentions tensor of shape [32768,16384], I assume it happens during the first (fully-connected) layer, which takes input image with two channels 128*128*2 = 32768 and outputs 128*128 = 16384 vector. I found recommendations to decrease the batch size, however, I already use 1 input image only.
Here it is written that using cudnn one could get up to 700-900px on the same AWS instance that I use (although, I do not know if they use fully-connected layers). I tried two different AMIs (1 and 2), both with cudnn installed, but still got memory error.

My questions are:
1. How do I calculate how much memory is needed for a [32768,16384] tensor? I am not a computer scientist, so I would appreciate a detailed reply.
2. I guess I am trying to understand whether the instance I use really has too little memory for my data (g2.2xlarge has 15 GiB) or I am just doing something wrong.

Error:

2018-01-24 16:36:53.666427: I 
tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports 
instructions that this TensorFlow binary was not compiled to use: SSE4.1 
SSE4.2 AVX
2018-01-24 16:36:55.069050: I 
tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:895] successful NUMA node 
read from SysFS had negative value (-1), but there must be at least one NUMA 
node, so returning NUMA node zero
2018-01-24 16:36:55.069287: I 
tensorflow/core/common_runtime/gpu/gpu_device.cc:1062] Found device 0 with 
properties: 
name: GRID K520 major: 3 minor: 0 memoryClockRate(GHz): 0.797
pciBusID: 0000:00:03.0
totalMemory: 3.94GiB freeMemory: 3.90GiB
2018-01-24 16:36:55.069316: I 
tensorflow/core/common_runtime/gpu/gpu_device.cc:1152] Creating TensorFlow 
device (/device:GPU:0) -> (device: 0, name: GRID K520, pci bus id: 
0000:00:03.0, compute capability: 3.0)
2018-01-24 16:37:59.766001: W 
tensorflow/core/common_runtime/bfc_allocator.cc:273] Allocator (GPU_0_bfc) ran 
out of memory trying to allocate 2.00GiB.  Current allocation summary follows.
2018-01-24 16:37:59.766054: I 
tensorflow/core/common_runtime/bfc_allocator.cc:628] Bin (256):     Total 
Chunks: 10, Chunks in use: 10. 2.5KiB allocated for chunks. 2.5KiB in use in 
bin. 40B client-requested in use in bin.
2018-01-24 16:37:59.766070: I 
tensorflow/core/common_runtime/bfc_allocator.cc:628] Bin (512):     Total 
Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B 
client-requested in use in bin.
2018-01-24 16:37:59.766084: I 
tensorflow/core/common_runtime/bfc_allocator.cc:628] Bin (1024):    Total 
Chunks: 1, Chunks in use: 1. 1.2KiB allocated for chunks. 1.2KiB in use in 
bin. 1.0KiB client-requested in use in bin.
2018-01-24 16:37:59.766094: I 
tensorflow/core/common_runtime/bfc_allocator.cc:628] Bin (2048):    Total 
Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B 
client-requested in use in bin.
2018-01-24 16:37:59.766108: I 
tensorflow/core/common_runtime/bfc_allocator.cc:628] Bin (4096):    Total 
Chunks: 2, Chunks in use: 2. 12.5KiB allocated for chunks. 12.5KiB in use in 
bin. 12.5KiB client-requested in use in bin.
2018-01-24 16:37:59.766122: I 
tensorflow/core/common_runtime/bfc_allocator.cc:628] Bin (8192):    Total 
Chunks: 2, Chunks in use: 2. 24.5KiB allocated for chunks. 24.5KiB in use in 
bin. 24.5KiB client-requested in use in bin.
2018-01-24 16:37:59.766134: I 
tensorflow/core/common_runtime/bfc_allocator.cc:628] Bin (16384):   Total 
Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B 
client-requested in use in bin.
2018-01-24 16:37:59.766143: I 
tensorflow/core/common_runtime/bfc_allocator.cc:628] Bin (32768):   Total 
Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B 
client-requested in use in bin.
2018-01-24 16:37:59.766155: I 
tensorflow/core/common_runtime/bfc_allocator.cc:628] Bin (65536):   Total 
Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B 
client-requested in use in bin.
2018-01-24 16:37:59.766163: I 
tensorflow/core/common_runtime/bfc_allocator.cc:628] Bin (131072):  Total 
Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B 
client-requested in use in bin.
2018-01-24 16:37:59.766177: I 
tensorflow/core/common_runtime/bfc_allocator.cc:628] Bin (262144):  Total 
Chunks: 2, Chunks in use: 2. 800.0KiB allocated for chunks. 800.0KiB in use in 
bin. 800.0KiB client-requested in use in bin.
2018-01-24 16:37:59.766196: I 
tensorflow/core/common_runtime/bfc_allocator.cc:628] Bin (524288):  Total 
Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B 
client-requested in use in bin.
2018-01-24 16:37:59.766208: I 
tensorflow/core/common_runtime/bfc_allocator.cc:628] Bin (1048576):     Total 
Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B 
client-requested in use in bin.
2018-01-24 16:37:59.766221: I 
tensorflow/core/common_runtime/bfc_allocator.cc:628] Bin (2097152):     Total 
Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B 
client-requested in use in bin.
2018-01-24 16:37:59.766230: I 
tensorflow/core/common_runtime/bfc_allocator.cc:628] Bin (4194304):     Total 
Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B 
client-requested in use in bin.
2018-01-24 16:37:59.766241: I 
tensorflow/core/common_runtime/bfc_allocator.cc:628] Bin (8388608):     Total 
Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B 
client-requested in use in bin.
2018-01-24 16:37:59.766250: I 
tensorflow/core/common_runtime/bfc_allocator.cc:628] Bin (16777216):    Total 
Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B 
client-requested in use in bin.
2018-01-24 16:37:59.766262: I         
tensorflow/core/common_runtime/bfc_allocator.cc:628] Bin (33554432):    Total 
Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B 
client-requested in use in bin.
2018-01-24 16:37:59.766271: I 
tensorflow/core/common_runtime/bfc_allocator.cc:628] Bin (67108864):    Total 
Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B 
client-requested in use in bin.
2018-01-24 16:37:59.766282: I 
tensorflow/core/common_runtime/bfc_allocator.cc:628] Bin (134217728):   Total 
Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B 
client-requested in use in bin.
2018-01-24 16:37:59.766292: I 
tensorflow/core/common_runtime/bfc_allocator.cc:628] Bin (268435456):   Total 
Chunks: 2, Chunks in use: 1. 3.57GiB allocated for chunks. 2.00GiB in use in 
bin. 2.00GiB client-requested in use in bin.
2018-01-24 16:37:59.766304: I 
tensorflow/core/common_runtime/bfc_allocator.cc:644] Bin for 2.00GiB was 
256.00MiB, Chunk State: 
2018-01-24 16:37:59.766335: I 
tensorflow/core/common_runtime/bfc_allocator.cc:650]   Size: 1.57GiB | 
Requested Size: 0B | in_use: 0, prev:   Size: 2.00GiB | Requested Size: 
2.00GiB | in_use: 1
2018-01-24 16:37:59.766358: I 
tensorflow/core/common_runtime/bfc_allocator.cc:662] Chunk at 0x702680000 of 
size 1280
2018-01-24 16:37:59.766374: I 
tensorflow/core/common_runtime/bfc_allocator.cc:662] Chunk at 0x702680500 of 
size 256
2018-01-24 16:37:59.766381: I 
tensorflow/core/common_runtime/bfc_allocator.cc:662] Chunk at 0x702680600 of 
size 256
2018-01-24 16:37:59.766387: I 
tensorflow/core/common_runtime/bfc_allocator.cc:662] Chunk at 0x702680700 of 
size 256
2018-01-24 16:37:59.766397: I 
tensorflow/core/common_runtime/bfc_allocator.cc:662] Chunk at 0x702680800 of 
size 256
2018-01-24 16:37:59.766402: I 
tensorflow/core/common_runtime/bfc_allocator.cc:662] Chunk at 0x702680900 of 
size 256
2018-01-24 16:37:59.766412: I 
tensorflow/core/common_runtime/bfc_allocator.cc:662] Chunk at 0x702680a00 of 
size 256
2018-01-24 16:37:59.766422: I 
tensorflow/core/common_runtime/bfc_allocator.cc:662] Chunk at 0x702680b00 of 
size 256
2018-01-24 16:37:59.766429: I 
tensorflow/core/common_runtime/bfc_allocator.cc:662] Chunk at 0x702680c00 of 
size 256
2018-01-24 16:37:59.766435: I 
tensorflow/core/common_runtime/bfc_allocator.cc:662] Chunk at 0x702680d00 of 
size 256
2018-01-24 16:37:59.766459: I 
tensorflow/core/common_runtime/bfc_allocator.cc:662] Chunk at 0x702680e00 of 
size 256
2018-01-24 16:37:59.766471: I 
tensorflow/core/common_runtime/bfc_allocator.cc:662] Chunk at 0x702680f00 of 
size 6400
2018-01-24 16:37:59.766477: I 
tensorflow/core/common_runtime/bfc_allocator.cc:662] Chunk at 0x702682800 of 
size 6400
2018-01-24 16:37:59.766482: I 
tensorflow/core/common_runtime/bfc_allocator.cc:662] Chunk at 0x702684100 of 
size 409600
2018-01-24 16:37:59.766492: I 
tensorflow/core/common_runtime/bfc_allocator.cc:662] Chunk at 0x7026e8100 of 
size 409600
2018-01-24 16:37:59.766499: I 
tensorflow/core/common_runtime/bfc_allocator.cc:662] Chunk at 0x70274c100 of 
size 12544
2018-01-24 16:37:59.766509: I 
tensorflow/core/common_runtime/bfc_allocator.cc:662] Chunk at 0x70274f200 of 
size 12544
2018-01-24 16:37:59.766517: I 
tensorflow/core/common_runtime/bfc_allocator.cc:662] Chunk at 0x702752300 of 
size 2147483648
2018-01-24 16:37:59.766523: I 
tensorflow/core/common_runtime/bfc_allocator.cc:671] Free at 0x782752300 of 
size 1684724992
2018-01-24 16:37:59.766530: I 
tensorflow/core/common_runtime/bfc_allocator.cc:677]      Summary of in-use 
Chunks by size: 
2018-01-24 16:37:59.766543: I 
tensorflow/core/common_runtime/bfc_allocator.cc:680] 10 Chunks of size 256 
totalling 2.5KiB
2018-01-24 16:37:59.766557: I 
tensorflow/core/common_runtime/bfc_allocator.cc:680] 1 Chunks of size 1280 
totalling 1.2KiB
2018-01-24 16:37:59.766569: I 
tensorflow/core/common_runtime/bfc_allocator.cc:680] 2 Chunks of size 6400 
totalling 12.5KiB
2018-01-24 16:37:59.766577: I 
tensorflow/core/common_runtime/bfc_allocator.cc:680] 2 Chunks of size 12544 
totalling 24.5KiB
2018-01-24 16:37:59.766585: I 
tensorflow/core/common_runtime/bfc_allocator.cc:680] 2 Chunks of size 409600 
totalling 800.0KiB
2018-01-24 16:37:59.766596: I 
tensorflow/core/common_runtime/bfc_allocator.cc:680] 1 Chunks of size 
2147483648 totalling 2.00GiB
2018-01-24 16:37:59.766606: I 
tensorflow/core/common_runtime/bfc_allocator.cc:684] Sum Total of in-use 
chunks: 2.00GiB
2018-01-24 16:37:59.766620: I 
tensorflow/core/common_runtime/bfc_allocator.cc:686] Stats: 
Limit:                  3833069568
InUse:                  2148344576
MaxInUse:               2148344576
NumAllocs:                      18
MaxAllocSize:           2147483648

2018-01-24 16:37:59.766635: W 
tensorflow/core/common_runtime/bfc_allocator.cc:277] 

2018-01-24 16:37:59.766660: W tensorflow/core/framework/op_kernel.cc:1188] 
Resource exhausted: OOM when allocating tensor of shape [32768,16384] and type 
float
2018-01-24 16:38:00.828932: E tensorflow/core/common_runtime/executor.cc:651] 
Executor failed to create kernel. Resource exhausted: OOM when allocating 
tensor of shape [32768,16384] and type float
[[Node: fc1/weights/RMSProp_1/Initializer/zeros = Const[_class=
["loc:@fc1/weights"], dtype=DT_FLOAT, value=Tensor<type: float shape: 
[32768,16384] values: [0 0 0]...>, 
_device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]
Traceback (most recent call last):
File "myAutomap.py", line 278, in <module>
print_cost=True)
File "myAutomap.py", line 240, in model
sess.run(init)
File "/usr/lib/python2.7/dist-packages/tensorflow/python/client/session.py", 
line 889, in run
run_metadata_ptr)
File "/usr/lib/python2.7/dist-packages/tensorflow/python/client/session.py", 
line 1120, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/lib/python2.7/dist-packages/tensorflow/python/client/session.py", 
line 1317, in _do_run
options, run_metadata)
File "/usr/lib/python2.7/dist-packages/tensorflow/python/client/session.py", 
line 1336, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when 
allocating tensor of shape [32768,16384] and type float
[[Node: fc1/weights/RMSProp_1/Initializer/zeros = Const[_class=
["loc:@fc1/weights"], dtype=DT_FLOAT, value=Tensor<type: float shape: 
[32768,16384] values: [0 0 0]...>, 
_device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]

Caused by op u'fc1/weights/RMSProp_1/Initializer/zeros', defined at:
File "myAutomap.py", line 278, in <module>
print_cost=True)
File "myAutomap.py", line 228, in model
optimizer = tf.train.RMSPropOptimizer(learning_rate).minimize(cost)
File "/usr/lib/python2.7/dist-
packages/tensorflow/python/training/optimizer.py", line 365, in minimize
name=name)
File "/usr/lib/python2.7/dist-
packages/tensorflow/python/training/optimizer.py", line 516, in 
apply_gradients
self._create_slots([_get_variable_for(v) for v in var_list])
File "/usr/lib/python2.7/dist-packages/tensorflow/python/training/rmsprop.py", 
line 113, in _create_slots
self._zeros_slot(v, "momentum", self._name)
File "/usr/lib/python2.7/dist-
packages/tensorflow/python/training/optimizer.py", line 882, in _zeros_slot
named_slots[_var_key(var)] = slot_creator.create_zeros_slot(var, op_name)
File "/usr/lib/python2.7/dist-
packages/tensorflow/python/training/slot_creator.py", line 174, in 
create_zeros_slot
colocate_with_primary=colocate_with_primary)
File "/usr/lib/python2.7/dist-
packages/tensorflow/python/training/slot_creator.py", line 148, in 
create_slot_with_initializer
dtype)
File "/usr/lib/python2.7/dist-
packages/tensorflow/python/training/slot_creator.py", line 67, in 
_create_slot_var
validate_shape=validate_shape)
File "/usr/lib/python2.7/dist-
packages/tensorflow/python/ops/variable_scope.py", line 1256, in get_variable
constraint=constraint)
File "/usr/lib/python2.7/dist-
packages/tensorflow/python/ops/variable_scope.py", line 1097, in get_variable
constraint=constraint)
File "/usr/lib/python2.7/dist-
packages/tensorflow/python/ops/variable_scope.py", line 435, in get_variable
constraint=constraint)
File "/usr/lib/python2.7/dist-
packages/tensorflow/python/ops/variable_scope.py", line 404, in _true_getter
use_resource=use_resource, constraint=constraint)
File "/usr/lib/python2.7/dist-
packages/tensorflow/python/ops/variable_scope.py", line 806, in 
_get_single_variable
constraint=constraint)
File "/usr/lib/python2.7/dist-packages/tensorflow/python/ops/variables.py", 
line 229, in __init__
constraint=constraint)
File "/usr/lib/python2.7/dist-packages/tensorflow/python/ops/variables.py", 
line 323, in _init_from_args
initial_value(), name="initial_value", dtype=dtype)
File "/usr/lib/python2.7/dist-
packages/tensorflow/python/ops/variable_scope.py", line 780, in <lambda>
shape.as_list(), dtype=dtype, partition_info=partition_info)
File "/usr/lib/python2.7/dist-packages/tensorflow/python/ops/init_ops.py", 
line 93, in __call__
return array_ops.zeros(shape, dtype)
File "/usr/lib/python2.7/dist-packages/tensorflow/python/ops/array_ops.py", 
line 1509, in zeros
output = constant(zero, shape=shape, dtype=dtype, name=name)
File "/usr/lib/python2.7/dist-
packages/tensorflow/python/framework/constant_op.py", line 218, in constant
name=name).outputs[0]
File "/usr/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", 
line 3069, in create_op
op_def=op_def)
File "/usr/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", 
line 1579, in __init__
self._traceback = self._graph._extract_stack()  # pylint: disable=protected-
access

ResourceExhaustedError (see above for traceback): OOM when allocating tensor 
of shape [32768,16384] and type float
[[Node: fc1/weights/RMSProp_1/Initializer/zeros = Const[_class=
["loc:@fc1/weights"], dtype=DT_FLOAT, value=Tensor<type: float shape: 
[32768,16384] values: [0 0 0]...>, 
_device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]

Errore di segmentazione

The amount of memory you need depends indeed largely on the size of the Tensor but ALSO on the datatype you use (int32, int64, float16, float32, float64). So to question 1: your Tensor will need 32768 x 16384 x memory_size_of_your_datatype memory (e.g. memory footprint of float_64 is 64 bits as the name suggests, which is 8 byte, so in this case your Tensor would need 4.3e9 bytes or 4.3 Gigabytes) One easy way to reduce memory consumption is thus to just go from float64 to float32 or even float16 (1/2 and 1/4, respectively) if the loss in precision doesn't hurt your accuracy too much. Also, you have to understand how the total memory of your AWS instance is made up, i.e. what is the GPU RAM of the GPUs that make up your instance, which is the critical piece of memory here.

Also, check out https://www.tensorflow.org/api_docs/python/tf/profiler/Profiler

Edit: You can pass a tf.ConfigProto() to your tf.Session(config=...) through which you can specify GPU usage.

Especially, look at the allow_growth, allow_soft_placement, per_process_gpu_memory_fraction options` (especially the last one should help you)