I am new to the field of neural networks and I would like to know the difference between Deep Belief Networks and Convolutional Networks.
Also, is there a Deep Convolutional Network which is the combination of Deep Belief and Convolutional Neural Nets?
This is what I have gathered till now. Please correct me if I am wrong.
For an image classification problem, Deep Belief networks have many layers, each of which is trained using a greedy layer-wise strategy.
For example, if my image size is 50 x 50, and I want a Deep Network with 4 layers namely
- Input Layer
- Hidden Layer 1 (HL1)
- Hidden Layer 2 (HL2)
- Output Layer
My input layer will have 50 x 50 = 2500 neurons, HL1 = 1000 neurons (say) , HL2 = 100 neurons (say) and output layer = 10 neurons,
in order to train the weights (W1) between Input Layer and HL1, I use an AutoEncoder (2500 - 1000 - 2500) and learn W1 of size 2500 x 1000 (This is unsupervised learning). Then I feed forward all images through the first hidden layers to obtain a set of features and then use another autoencoder ( 1000 - 100 - 1000) to get the next set of features and finally use a softmax layer (100 - 10) for classification. (only learning the weights of the last layer (HL2 - Output which is the softmax layer) is supervised learning).
(I could use RBM instead of autoencoder).
If the same problem was solved using Convolutional Neural Networks, then for 50x50 input images, I would develop a network using only 7 x 7 patches (say). My layers would be
- Input Layer (7 x 7 = 49 neurons)
- HL1 (25 neurons for 25 different features) - (convolution layer)
- Pooling Layer
- Output Layer (Softmax)
And for learning the weights, I take 7 x 7 patches from images of size 50 x 50, and feed forward through convolutional layer, so I will have 25 different feature maps each of size (50 - 7 + 1) x (50 - 7 + 1) = 44 x 44.
I then use a window of say 11x11 for pooling hand hence get 25 feature maps of size (4 x 4) for as the output of the pooling layer. I use these feature maps for classification.
While learning the weights, I don't use the layer wise strategy as in Deep Belief Networks (Unsupervised Learning), but instead use supervised learning and learn the weights of all the layers simultaneously. Is this correct or is there any other way to learn the weights?
Is what I have understood correct?
So if I want to use DBN's for image classification, I should resize all my images to a particular size (say 200x200) and have that many neurons in the input layer, whereas in case of CNN's, I train only on a smaller patch of the input (say 10 x 10 for an image of size 200x200) and convolve the learnt weights over the entire image?
Do DBNs provide better results than CNNs or is it purely dependent on the dataset?
Thank You.
Generally speaking, DBNs are generative neural networks that stack Restricted Boltzmann Machines (RBMs) . You can think of RBMs as being generative autoencoders; if you want a deep belief net you should be stacking RBMs and not plain autoencoders as Hinton and his student Yeh proved that stacking RBMs results in sigmoid belief nets.
Convolutional neural networks have performed better than DBNs by themselves in current literature on benchmark computer vision datasets such as MNIST. If the dataset is not a computer vision one, then DBNs can most definitely perform better. In theory, DBNs should be the best models but it is very hard to estimate joint probabilities accurately at the moment. You may be interested in Lee et. al's (2009) work on Convolutional Deep Belief Networks which looks to combine the two.
I will try to explain the situation through learning shoes.
If you use DBN to learn those images here is the bad thing that will happen in your learning algorithm
there will be shoes on different places.
all the neurons will try to learn not only shoes but also the place of the shoes in the images because it will not have the concept of 'local image patch' inside weights.
DBN makes sense if all your images are aligned by means of size, translation and rotation.
the idea of convolutional networks is that, there is a concept called weight sharing. If I try to extend this 'weight sharing' concept
first you looked at 7x7 patches, and according to your example - as an example of 3 of your neurons in the first layer you can say that they learned shoes 'front', 'back-bottom' and 'back-upper' parts as these would look alike for a 7x7 patch through all shoes.
Normally the idea is to have multiple convolution layers one after another to learn
- lines/edges in the first layer,
- arcs, corners in the second layer,
- higher concepts in higher layers like shoes front, eye in a face, wheel in a car or rectangles cones triangles as primitive but yet combinations of previous layers outputs.
You can think of these 3 different things I told you as 3 different neurons. And such areas/neurons in your images will fire when there are shoes in some part of the image.
Pooling will protect your higher activations while sub-sampling your images and creating a lower-dimensional space to make things computationally easier and feasible.
So at last layer when you look at your 25X4x4, in other words 400 dimensional vector, if there is a shoe somewhere in the picture your 'shoe neuron(s)' will be active whereas non-shoe neurons will be close to zero.
And to understand which neurons are for shoes and which ones are not you will put that 400 dimensional vector to another supervised classifier(this can be anything like multi-class-SVM or as you said a soft-max-layer)
I can advise you to have a glance at Fukushima 1980 paper to understand what I try to say about translation invariance and line -> arc -> semicircle -> shoe front -> shoe idea (http://www.cs.princeton.edu/courses/archive/spr08/cos598B/Readings/Fukushima1980.pdf). Even just looking at the images in the paper will give you some idea.