I am working on a web app using Symfony 2.7 and Doctrine. A Symfony command is used to perform an update of a large number of entities.
I followed the Doctrine guidelines and use $entityManager->flush()
not for every single entity.
This is die Doctrine example code:
<?php
$batchSize = 20;
for ($i = 1; $i <= 10000; ++$i) {
$user = new CmsUser;
$user->setStatus('user');
$user->setUsername('user' . $i);
$user->setName('Mr.Smith-' . $i);
$em->persist($user);
if (($i % $batchSize) === 0) {
$em->flush();
}
}
$em->flush(); //Persist objects that did not make up an entire batch
The guidelines say:
You may need to experiment with the batch size to find the size that works best for you. Larger batch sizes mean more prepared statement reuse internally but also mean more work during flush.
So I did try different batch size. The larger the batch size, the faster the command completes its task.
Thus the question is: What are the downsides of large batch sizes? Why not use $entityManager->flush()
only once, after all entities have been updated
The docu just says, that larger batch sizes "mean more work during flush". But why/when could this be a problem?
The only downside I can see are Exceptions
during the update: If the script stops before the saved changed where flushed, the changes are lost. Is this the only limitation?
Large batch sizes may use a lot of memory if you create for examples 10,000 entities. If you don't save the entities in batchs, they will accumulate in memory and if the program reach the memory limit it may crash the whole script.
It's possible, but storing 10,000 entities in the memory before calling
flush()
once will use more memory than saving entities 100 by 100. It may also take more time.If you don't have any performance issue with biggest batch sizes, it's probably because your data is not big enough to fill the memory or disrupt PHP's memory management.
So the size of the batch depend of multiple factors, mostly memory usage vs. time. If the script consume too much RAM, the size of the batch has to be lowered. But using really small batches may take more time than bigger batches. So you have to run multiple tests in order to adjust this size so that it uses most of the available memory but not more.
I don't have any proofs but I remember having worked with thousands of entities. When I used only one
flush()
, I saw that the progress bar was getting slower, it looked like my program was getting slower as I added more and more entities in the memory.If the flush takes too much time, you might exceed the maximum execution time of the server, and lose the connection.
From my experience, 100 entities per batch worked great. Depending on the Entity, 200 was too much and other Entity, I could do 1000.
To properly insert in batch, you will need the command :
after each of your flushes. The reason is the Doctrine does not free the objects it's flushing into the DB. This means that if you don't "clear" them, the memory consumption will keep on increasing until you bust your PHP Memory Limit and crash your operation.
I would also recommend against increasing PHP Memory Limit to higher values. If you do, you risk creating huge lag on your server which could increase the number of connections to your server and then crash it.
It is also recommended to process batch operations outside of the Web Server upload form page. So save the data in a Blob and then process it later with a Cronjob task that will process your batch processing at the desired time (outside of Web Server's peak usage time).
As suggested in Doctrine documentation, ORM is not the best tool to use with batches.
Unless your entity needs some specific logic (like listeners), avoid ORM and use DBAL directly.