Solr indexing issue (out of memory) - looking for

2019-08-03 08:12发布

问题:

I have a large index of 50 Million docs. all running on the same machine (no sharding). I don't have an ID that will allow me to update the wanted docs, so for each update I must delete the whole index and to index everything from scratch and commit only at the end when I'm done indexing.

My problem is that every few index runs, My Solr crashes with out of memory exception, I am running with 12.5 GB memory. From what I understand, until the commit everything is being saved in the memory, so I'm storing in the memory 100M docs instead of 50M. am I right? But I cannot make commits while I'm indexing, because I deleted all docs at the beginning and than I'll run with partial index which is bad.

Is there any known solutions for that? can sharding solve it or I still going to have the same problem? Is there a flag that allow me to make soft-commits but it won't change the index until the hard-commit?

回答1:

You can use the master slave replication. Just dedicate one machine to do your indexing (master solr), and then, if it's finished, you can tell the slave to replicate the index from the master machine. The slave will download the new index, and it will only delete the old index if the download is successful. So it's quite safe.

http://wiki.apache.org/solr/SolrReplication

One other solution to avoid all this replication set-up is to use a reverse proxy, put nginx or something of the like in front of your solr. Use one machine for indexing the new data, and the other for searching. And you can just make the reverse proxy to always point at the one not currently doing any indexing.

If you do one of them, then you can just commit as often as you want.

And because it's generally a bad idea to do indexing and search in one same machine, I will prefer to use the master-slave solution (not to mention you have 50M docs).



回答2:

out of memory error can be solved by providing more memory to jvm of your container it has nothing to do with your cache . Use better options for Garbage collection because source of error is your jvm memory being full. Increase the number of threads because if number of threads for a process is reached a new process is spawn (which have same number of threads as prior one and same memory allocation ).

PLease also write about cpu spike , and any other type of caching mechanism you are using

you can try one thing thats to put all auto warmup to 0 it would speed up commit time

regards

Rajat