Choosing a solr/lucene commit strategy

2019-04-09 21:46发布

问题:

I have 120k db records to commit into a Solr index.

My question is: should I commit after submitting every 10k records, or only commit once after submitting all the 120k records?

Is there any difference between these two options?

回答1:

Use Solr's default auto-commit values, which I believe are quite reasonable. If not, you can adjust them to suit your needs:

<!-- autocommit pending docs if certain criteria are met.  Future versions may expand the available
 criteria -->
<autoCommit>
  <maxDocs>10000</maxDocs> <!-- maximum uncommited docs before autocommit triggered -->
  <maxTime>50000</maxTime> <!-- maximum time (in MS) after adding a doc before an autocommit is triggered -->
</autoCommit>

This means that it will commit when there are more than 10000 docs waiting to be committed, or 50s have passed since a document was added.



回答2:

According to the Lucene 2.9.3 documentation, commit() allows readers to see the added documents and puts all added/deleted documents on the index in the disk. It is a costly operation.

So if you want to see part of the documents while adding others, or want an assurance that you will not lose an added set of documents larger than 10,000 documents, you need to commit every 10,000 records.

OTOH, If you prefer to save the extra commits time, and are not afraid to lose documents if the machine fails, commit only after all of the documents were added.



回答3:

The recommended way is to use commitWithin instead of <autoCommit>.

If you are using SolrJ, almost all methods have a commitWithin parameter to use this feature.