Gradient Descent vs Stochastic Gradient Descent al

2020-06-19 08:32发布

问题:

I tried to train a FeedForward Neural Network on the MNIST Handwritten Digits dataset (includes 60K training samples).

I each time iterated over all the training samples, performing Backpropagation for each such sample on every epoch. The runtime is of course too long.

  • Is the algorithm I ran named Gradient Descent?

I read that for large datasets, using Stochastic Gradient Descent can improve the runtime dramatically.

  • What should I do in order to use Stochastic Gradient Descent? Should I just pick the training samples randomly, performing Backpropagation on each randomly picked sample, instead of the epochs I currently use?

回答1:

The new scenario you describe (performing Backpropagation on each randomly picked sample), is one common "flavor" of Stochastic Gradient Descent, as described here: https://www.quora.com/Whats-the-difference-between-gradient-descent-and-stochastic-gradient-descent

The 3 most common flavors according to this document are (Your flavor is C):

A)

randomly shuffle samples in the training set
for one or more epochs, or until approx. cost minimum is reached:
    for training sample i:
        compute gradients and perform weight updates

B)

for one or more epochs, or until approx. cost minimum is reached:
    randomly shuffle samples in the training set
    for training sample i:
        compute gradients and perform weight updates

C)

for iterations t, or until approx. cost minimum is reached:
    draw random sample from the training set
    compute gradients and perform weight updates


回答2:

I'll try to give you some intuition over the problem...

Initially, updates were made in what you (correctly) call (Batch) Gradient Descent. This assures that each update in the weights is done in the "right" direction (Fig. 1): the one that minimizes the cost function.

With the growth of datasets size, and complexier computations in each step, Stochastic Gradient Descent came to be preferred in these cases. Here, updates to the weights are done as each sample is processed and, as such, subsequent calculations already use "improved" weights. Nonetheless, this very reason leads to it incurring in some misdirection in minimizing the error function (Fig. 2).

As such, in many situations it is preferred to use Mini-batch Gradient Descent, combining the best of both worlds: each update to the weights is done using a small batch of the data. This way, the direction of the updates is somewhat rectified in comparison with the stochastic updates, but is updated much more regularly than in the case of the (original) Gradient Descent.

[UPDATE] As requested, I present below the pseudocode for batch gradient descent in binary classification:

error = 0

for sample in data:
    prediction = neural_network.predict(sample)
    sample_error = evaluate_error(prediction, sample["label"]) # may be as simple as 
                                                # module(prediction - sample["label"])
    error += sample_error

neural_network.backpropagate_and_update(error)

(In the case of multi-class labeling, error represents an array of the error for each label.)

This code is run for a given number of iterations, or while the error is above a threshold. For stochastic gradient descent, the call to neural_network.backpropagate_and_update() is called inside the for cycle, with the sample error as argument.