What is the advantage of using multiples of the same filter in convolutional networks in deep learning?
For example:
We use 6 filter of size [5,5] at the first layer to scan the image data, which is a matrix of size [28,28].
The question is why do we not use only a single filter of size [5,5] but use 6 or more of them. In the end they will scan the exact same pixels. I can see that the random weight might be different but DL model will adjust to it anyway.
So, specifically what is the main advantage and purpose of using multiple filters of the same shape then in convnets?
Why is filter shape the same?
First, the kernel shape is the same merely to speed up computation. This allows to apply the convolution in a batch, for example using col2im transformation and matrix multiplication. This also makes it convenient to store all the weights in one multidimensional array. Though mathematically one can imagine using several filters of different shape.
Some architectures, such as Inception network, use this idea and apply different convolutional layers (with different kernels) in parallel and in the end stack up the feature maps. This turned out to be very useful.
Why isn't one filter enough?
Because each filter is going to learn exactly one pattern that will excite it, e.g., Gabor-like vertical line. A single filter can't be equally excited by a horizontal and a vertical line. So to recognize an object, one such filter is not enough.
For example, in order to recognize a cat, a neural network might need to recognize the eyes, the tail, ... of all which are composed of different lines and edges. The network can be confident about the object on the image if it can recognize a whole variety of different shapes and patterns in the image. This will be true even for a simple data set like MNIST.
Why do filters learn different patterns?
A simple analogy: imagine a linear regression network with one hidden layer. Each neuron in the hidden layer is connected to each input feature, so they are all symmetrical. But after some training, different neurons are going to learn different high-level features, which are useful to make a correct prediction.
There's a catch: if the network is initialized with zeros, it's going to suffer from symmetry issues and in general won't converge to the target distribution. So it's essential to create asymmetry in the neurons from the very beginning and let different neurons get excited differently from the same input data. This in turn leads to different gradients getting applied to the weights, usually increasing the asymmetry even more. That's why different neurons are trained differently.
It's important to mention another issue that is still possible with random init called co-adaptation: when different neurons learn to adapt and depend on each other. This problem has been solved by a dropout technique and later by batch normalization, essentially by adding noise to the training process, in various ways. Combining it together, neurons are much more likely to learn different latent representations of the data.
Further links
Highly recommend to read CS231n tutorial by Stanford to gain better intuition about convolutional neural networks.
Zeiler and Fergus https://arxiv.org/pdf/1311.2901.pdf have a good picture showing kernel response to different parts of a picture.
Each kernel convolves over the image, so all the kernels (potentially) see all the pixels. Each of your 6 filters will "learn" a different feature. In the first layer, some will typically learn line features that look like lines (horizontal, vertical, diagonal) and some will learn colour blobs. In the next layer, these get combined. Pixels into edges into shapes.
It might help to look up Prewitt filters https://en.m.wikipedia.org/wiki/Prewitt_operator
In this case, it is a single 3x3 kernel which convolves over the whole image and gives a feature map showing horizontal (or vertical) edges. You need one filter for horizontal and a different filter for vertical, but you can combine them to give both. In a neural network, the kernel values are learned from data but the feature maps at each layer are still produced by convolving the kernel over the input.