Using R and the package neuralnet
, I try to model data that looks like this:
These are temperature readings in 10 min intervals over several days (above is a 2 day cutout). Using the code below, I fit a neural network to the data. There are probably simpler ways to model this exact data, but in the future the data might look quite different. Using a single hidden layer with 2 neurons gives me satisfactory results:
This also works most of the time with more layers and neurons. However, with one hidden layer with one neuron and occasionally with two layers (in my case 3 and 2 neurons respectively), I get rather poor results, always in the same shape:
The only thing random is the initialization of start weights, so I assume it's related to that. However, I must admit that I have not fully grasped the theory of neural networks yet. What I would like to know is, whether the poor results are due to a local minimum ('neuralnet' uses resilient backpropagation with weight backtracking by default) and I'm simply out of luck, or if I can avoid such a scenario. I am under the impression that there is an optimal number of hidden nodes for fitting e.g. polynomials of degree 2, 5, 10. If not, what's my best course of action? A larger learning rate? Smaller error threshold? Thanks in advance.
I have not tried tuning the rprop parameters yet, so the solution might lie there.
Code:
# DATA ----------------------
minute <- seq(0, 6*24 - 1)
temp <- rep.int(17, 6*24)
temp[(6*7):(6*20)] <- 20
n <- 10
dta <- data.frame(Zeit = minute, Status = temp)
dta <- dta[rep(seq_len(nrow(dta)), n), ]
# Scale everything
maxs <- apply(dta, 2, max)
mins <- apply(dta, 2, min)
nnInput <- data.frame(Zeit = dta$Zeit, Status = dta$Status)
nnInput <- as.data.frame(scale(nnInput, center = mins, scale = maxs - mins))
trainingData <- nnInput[seq(1, nrow(nnInput), 2), ]
testData <- nnInput[seq(2, nrow(nnInput), 2), ]
# MODEL ---------------------
model <- as.formula("Status ~ Zeit")
net <- neuralnet::neuralnet(model,
trainingData,
hidden = 2,
threshold = 0.01,
linear.output = TRUE,
lifesign = "full",
stepmax = 100000,
rep = 1)
net.results <- neuralnet::compute(net, testData$Zeit)
results <- net.results$net.result * (maxs["Status"] - mins["Status"]) + mins["Status"]
testData <- as.data.frame(t(t(testData) * (maxs - mins) + mins))
cleanOutput <- data.frame(Actual = testData$Status,
Prediction = results,
diff = abs(results - testData$Status))
summary(cleanOutput)
plot(cleanOutput$Actual[1:144], main = "Zeittabelle", xlab = paste("Min. seit 0:00 *", n), ylab = "Temperatur")
lines(cleanOutput$Prediction[1:144], col = "red", lwd = 3)
Basically - initialization is really important. If you don't initialize it randomly then you might make your network not working at all (e.g. by setting all the weights to 0
). It is also proven that for sigmoid and relu a certain kind of activation might help in training your network.
But in your case - I think that the differences are mostly made by the complexity of your problem. With a models with a complexity which seem to fit the complexity of your problem performs nice. The other models may suffer for the following reasons:
- Too small complexity - with one node maybe you are basically not able to learn the proper function.
- Too big complexity - with two-layer network you might experience stucking in a local minimas. Increasing the number of parameters of your network is also increasing the size of parameter space. Of course - one hand you might get the better model - on the other hand - you may land in this region of a parameter space which will result in poor solution. Maybe trying the same model with different initialization - and choosing the best model might overcome this issue.
UPDATE:
With small network sizes - it is quite usual to stuck in a local minimum. Depending on the amount of time which you need to train your network you may use the following techniques to overcome that:
- Dropout / Batch normalization / Batch learning randomization : when you are able to train your network for a little bit longer time - you might use a randomization properties of dropout or batch normalization. Due to this random fluctuations you are able to move from poor local minima (which are usually believed to be relatively shallow).
- Cross - validation / Multiple run: When you are starting your training multiple times - the probability that you will finish in a poor minimum significantly decreases.
About the connection between layer size and polynomial degree - I think that the question is not clearly stated. You must specify more details like e.g. the activation function. I also think that the nature of a polynomials and functions which could be modelled by a classic neural networks differs a lot. In polynomials - the small change in parameters values usually tends to much higher difference than in neural network case. Usually a derivative of a neural network is a bounded function whereas the polynomial derivative is unbounded when the degree is bigger that 2. Due to this facts I think - that looking for a dependency between a polynomial degree and a size of a hidden layer might be not worth serious considerations.
All you need is a good init (2016) : This paper proposes a simple method for weight initialization for deep net learning (http://arxiv.org/abs/1511.06422)
Watch this 6 mins video by andrew ng (Machine Learning, Coursera -> Week 5-> Random Initialization) explains danger of setting all initial weights to zero in Backpropagation (https://www.coursera.org/learn/machine-learning/lecture/ND5G5/random-initialization)
If we initialize all weights to the same value (e.g. zero or one). In this case, each hidden unit will get exactly the same signal. E.g. if all weights are initialized to 1, each unit gets signal equal to sum of inputs (and outputs sigmoid(sum(inputs))). If all weights are zeros, which is even worse, every hidden unit will get zero signal. No matter what was the input - if all weights are the same, all units in hidden layer will be the same too. This is why one should initialize weights randomly.