Why does different batch-sizes give different accu

2020-08-09 04:32发布

问题:

I was using Keras' CNN to classify MNIST dataset. I found that using different batch-sizes gave different accuracies. Why is it so?

Using Batch-size 1000 (Acc = 0.97600)

Using Batch-size 10 (Acc = 0.97599)

Although, the difference is very small, why is there even a difference? EDIT - I have found that the difference is only because of precision issues and they are in fact equal.

回答1:

That is because of the Mini-batch gradient descent effect during training process. You can find good explanation Here that I mention some notes from that link here:

Batch size is a slider on the learning process.

  1. Small values give a learning process that converges quickly at the cost of noise in the training process.
  2. Large values give a learning process that converges slowly with accurate estimates of the error gradient.

and also one important note from that link is :

The presented results confirm that using small batch sizes achieves the best training stability and generalization performance, for a given computational cost, across a wide range of experiments. In all cases the best results have been obtained with batch sizes m = 32 or smaller

Which is the result of this paper.

EDIT

I should mention two more points Here:

  1. because of the inherent randomness in machine learning algorithms concept, generally you should not expect machine learning algorithms (like Deep learning algorithms) to have same results on different runs. You can find more details Here.
  2. On the other hand both of your results are too close and somehow they are equal. So in your case we can say that the batch size has no effect on your network results based on the reported results.


回答2:

For the bigger batch size ( 1000 ):

The data was divided into chunks of data which had 1000 samples each.

If suppose you have a dataset of 3000 samples, then 3 batches will be formed.

The optimizer will then optimize the NN for each batch rather than each sample. Hence, the optimizations would take place 3 times.

For a smaller batch size ( 10 ) :

300 batches will be formed considering the above example. Hence, 300 optimizations would take place.

A smaller batch size makes more optimizations and hence, the model generalises better and accuracy increases.



回答3:

This is not connected to Keras. The batch size, together with the learning rate, are critical hyper-parameters for training neural networks with mini-batch stochastic gradient descent (SGD), which entirely affect the learning dynamics and thus the accuracy, the learning speed, etc.

In a nutshell, SGD optimizes the weights of a neural network by iteratively updating them towards the (negative) direction of the gradient of the loss. In mini-batch SGD, the gradient is estimated at each iteration on a subset of the training data. It is a noisy estimation, which helps regularize the model and therefore the size of the batch matters a lot. Besides, the learning rate determines how much the weights are updated at each iteration. Finally, although this may not be obvious, the learning rate and the batch size are related to each other. [paper]



回答4:

I want to add two points:

1) When use special treatments, it is possible to achieve similar performance for a very large batch size while speeding-up the training process tremendously. For example, Accurate, Large Minibatch SGD:Training ImageNet in 1 Hour

2) Regarding your MNIST example, I really don't suggest you to over-read these numbers. Because the difference is so subtle that it could be caused by noise. I bet if you try models saved on a different epoch, you will see a different result.