问题:

Can we use different Activation functions for hidden layer and output layer of Neural network? Is there any explicit advantage of using such scheme?

回答1:

In short - yes you can. It is a common approach to use sigmoid function as a hidden layer activation to ensure nonlinear features, and activation in the output selected for a particular task (depending on what you are trying to model, and what cost function you use).

回答2:

For the last layer of the network the activation unit also depends on the task.

Classification: You will want only one of your outputs to be one of the labels, but there's no differentiable way to achieve precisely that, so you will want to use a softmax to approximate it.
Regression: You will want to use the sigmoid or tanh activation, because you want the result to be linear. With use of sigmoid and tanh, the output will be scaled between 0 and 1. So will be easy to optimize.

For intermediate layers, now a days Relu is used by most of the people because of its faster to calculate and it won't vanishes early in back-propogation.

回答3:

If you are implementing the prediction task instead of classification, you may use linear combination in the output layer since the sigmoid function restrains your output range to (0,1), which is often applied in threshold-based classification problems.