I read from somewhere that if you choose a batch size that is a power 2, training will be faster. What is this rule? Is this applicable to other applications? Can you provide a reference paper?
相关问题
- batch_dot with variable batch size in Keras
- How to use Reshape keras layer with two None dimen
- How to use Reshape keras layer with two None dimen
- CV2 Image Error: error: (-215:Assertion failed) !s
- Why keras use “call” instead of __call__?
相关文章
- tensorflow 神经网络 训练集准确度远高于验证集和测试集准确度?
- Tensorflow: device CUDA:0 not supported by XLA ser
- Numpy array to TFrecord
- conditional graph in tensorflow and for loop that
- How to downgrade to cuda 10.0 in arch linux?
- Apply TensorFlow Transform to transform/scale feat
- How to use cross_val_score with random_state
- How to force tensorflow tensors to be symmetric?
I've heard this, too. Here's a white paper about training on CIFAR-10 where some Intel researchers make the claim:
(See: https://software.intel.com/en-us/articles/cifar-10-classification-using-intel-optimization-for-tensorflow.)
However, it's unclear just how big the advantage may be because the authors don't provide any training duration data :/
Since the number of PP is often a power of 2, using a number of
C
different from a power of 2 leads to poor performance.You can see the mapping of the
C
onto thePP
as a pile of slices of size the number ofPP
. Say you've got 16PP
. You can map 16C
on them : 1C
is mapped onto 1PP
. You can map 32C
on them : 2 slices of 16C
, 1PP
will be responsible for 2C
.This is due to the SIMD paradigm used by GPUs. This is often called Data Parallelism : all the
PP
do the same thing at the same time but on different data.Algorithmically speaking, using larger mini-batches allows you to reduce the variance of your stochastic gradient updates (by taking the average of the gradients in the mini-batch), and this in turn allows you to take bigger step-sizes, which means the optimization algorithm will make progress faster.
However, the amount of work done (in terms of number of gradient computations) to reach a certain accuracy in the objective will be the same: with a mini-batch size of n, the variance of the update direction will be reduced by a factor n, so the theory allows you to take step-sizes that are n times larger, so that a single step will take you roughly to the same accuracy as n steps of SGD with a mini-batch size of 1.
As for tensorFlow, I found no evidence of your affirmation, and its a question that has been closed on github : https://github.com/tensorflow/tensorflow/issues/4132
Note that image resized to power of two makes sense (because pooling is generally done in 2X2 windows), but that’s a different thing altogether.