Understanding output of Dense layer for higher dim

I don't have problem in understanding output shape of a Dense layer followed by a Flatten layer. Output shape is in accordance of my understanding i.e (Batch size, unit).

nn= keras.Sequential()
nn.add(keras.layers.Conv2D(8,kernel_size=(2,2),input_shape=(4,5,1)))
nn.add(keras.layers.Conv2D(1,kernel_size=(2,2)))
nn.add(keras.layers.Flatten())
nn.add(keras.layers.Dense(5))
nn.add(keras.layers.Dense(1))

nn.summary()

Output is:

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
conv2d_1 (Conv2D)            (None, 3, 4, 8)           40        
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 2, 3, 1)           33        
_________________________________________________________________
flatten_1 (Flatten)          (None, 6)                 0         
_________________________________________________________________
dense_1 (Dense)              (None, 5)                 35        
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 6         
=================================================================
Total params: 114
Trainable params: 114
Non-trainable params: 0
_________________________________________________________________

But I am having trouble in understanding the output shape of a dense layer for multidimensional input .So for following code

nn= keras.Sequential()
nn.add(keras.layers.Conv2D(8,kernel_size=(2,2),input_shape=(4,5,1)))
nn.add(keras.layers.Conv2D(1,kernel_size=(2,2)))
#nn.add(keras.layers.Flatten())
nn.add(keras.layers.Dense(5))
nn.add(keras.layers.Dense(1))

nn.summary()

output is

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
conv2d_1 (Conv2D)            (None, 3, 4, 8)           40        
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 2, 3, 1)           33        
_________________________________________________________________
dense_1 (Dense)              (None, 2, 3, 5)           10        
_________________________________________________________________
dense_2 (Dense)              (None, 2, 3, 1)           6         
=================================================================
Total params: 89
Trainable params: 89

I am unable to make intuition for output shape of dense_1 and dense_2 layer. Shouldn't the final output be a scalar or (batch,unit)? Following answer to similar question tries to explain the intuition but I can not fully grasp the concept. From the same answer:

That is, each output "pixel" (i, j) in the 640x959 grid is calculated as a dense combination of the 8 different convolution channels at point (i, j) from the previous layer.

May be some explanation with pictures will be useful .

This is tricky but it does fit with the documentation from Keras on dense layers,

Output shape

nD tensor with shape: (batch_size, ..., units). For instance, for a 2D input with shape (batch_size, input_dim), the output would have shape (batch_size, units)

Note it is not the clearest, but they are saying with the ... that the final dimension of the input shape will be elided by the number of dense connections. Basically, for each item of the final dimension, create a connection to each of the requested dense nodes in the coming dense layer.

In your case you have something which is 2 x 3 x 1. So there is "one thing" (the 2 x 3 thing) to be connected to each of the 5 dense layer nodes, hense 2 x 3 x 5. You can think of it like channels of a CNN layer in this particular case. There is a distinct 2 x 3 sheet of outputs for each of the 5 output "nodes".

In a purely 2-D case (batch_size, units) ... then each item iterated by the final dimension units is itself a scalar value, so you end up with something of exactly the size of the number of dense nodes requested.

But in a higher-dimensional case, each item you iterate along the final dimension of the input will itself still be a higher-dimensional thing, and so the output is k distinct "clones" of those higher-dimensional things, where k is the dense layer size requested, and by "clone" we mean the output for a single dense connection has the same shape as the the items in the final dimension of the input.

Then the Dense-ness means that each specific element of that output has a connection to each element of the corresponding set of inputs. But be careful about this. Dense layers are defined by having "one" connection between each item of the output and each item of the input. So even though you have 5 "2x3 things" in your output, they each just have one solitary weight associated with them about how they are connected to the 2x3 thing that is the input. Keras also defaults to using a bias vector (not bias tensor), so if the dense layer has dimension k and the final dimension of the previous layer is n you should expect (n+1)k trainable parameters. These will always be used with numpy-like broadcasting to make the lesser dimensional shape of the weight and bias vectors conformable to the actual shapes of the input tensors.

It is customary to use Flatten as in your first example if you want to enforce the exact size of the coming dense layer. You would use multidimensional Dense layer when you want different "(n - 1)D" groups of connections to each Dense node. This is probably extremely rare for higher dimensional inputs because you'd typically want a CNN type of operation, but I could imagine maybe in some cases where a model predicts pixel-wise values or if you are generating a full nD output, like from the decoder portion of an encoder-decoder network, you might want a dense array of cells that match the dimensions of some expected structured output type like an image or video.