According to Wikipedia (which is a bad source, I know) A neural network is comprised of
An input layer of A neurons
Multiple (B) Hidden layers each comprised of C neurons.
An output layer of "D" neurons.
I understand what does input and output layers mean.
My question is how to determine an optimal amount of layers and neuron-per-layer?
- What is the advantage/disadvantage of a increasing "B"?
- What is the advantage/disadvantage of a increasing "C"?
- What is the difference between increasing "B" vs. "C"?
Is it only the amount of time (limits of processing power) or will making the network deeper limit quality of results and should I focus more on depth (more layers) or on breadth (more neurons per layer)?
The number of layers/nodes depends on the classification task and what you expect of the NN. Theoretically, if you have a linearly separable function/decision (e.g the boolean AND function), 1 layer (i.e only the input layer with no hidden layer) will be able to form a hyperplane and would be enough. If your function isn't linearly separable (e.g the boolean XOR), then you need hidden layers.
With 1 hidden layer, you can form any, possibly unbounded convex region. Any bounded continuous function with a finite mapping can be represented. More on that here.
2 hidden layers, on the other hand, are capable of representing arbitrarily complex decision boundaries. The only limitation is the number of nodes. In a typical 2-hidden layer network, first layer computes the regions and the second layer computes an AND operation (one for each hypercube). Lastly, the output layer computes an OR operation.
According to Kolmogorov's Theorem, all functions can be learned by a 2-hidden layer network and you never ever need more than 2 hidden layers. However, in practice, 1-hidden-layer almost always does the work.
In summary, fix B=0 for linearly separable functions and B=1 for everything else.
As for C and the relationship of B and C, have a look The Number of Hidden Layers. It provides general information and mentions underfitting, overfitting.
The author suggests one of the following as a rule of thumb:
Answer 1. One Layer will model most of the problems OR at max two layers can be used.
Answer 2. If an inadequate number of neurons are used, the network will be unable to model complex data, and the resulting fit will be poor. If too many neurons are used, the training time may become excessively long, and, worse, the network may over fit the data. When overfitting $ occurs, the network will begin to model random noise in the data. The result is that the model fits the training data extremely well, but it generalizes poorly to new, unseen data. Validation must be used to test for this.
$ What is overfitting?
In statistics, overfitting occurs when a statistical model describes random error or noise instead of the underlying relationship. Overfitting generally occurs when a model is excessively complex, such as having too many parameters relative to the number of observations. A model which has been overfit will generally have poor predictive performance, as it can exaggerate minor fluctuations in the data. The concept of overfitting is important in machine learning. Usually a learning algorithm is trained using some set of training examples, i.e. exemplary situations for which the desired output is known. The learner is assumed to reach a state where it will also be able to predict the correct output for other examples, thus generalizing to situations not presented during training (based on its inductive bias). However, especially in cases where learning was performed too long or where training examples are rare, the learner may adjust to very specific random features of the training data, that have no causal relation to the target function. In this process of overfitting, the performance on the training examples still increases while the performance on unseen data becomes worse.
Answer 3. Read Answer 1 & 2.
Supervised Learning article on wikipedia (http://en.wikipedia.org/wiki/Supervised_learning) will give you more insight on what are the factors which are relly important with respect to any supervised learning system including Neural Netowrks. The article talks about Dimensionality of Input Space, Amount of training data, Noise etc.