Lucene documentation states that single instances of IndexSearcher and IndexWriter should be used for each index in the whole application, and across all threads. Also, writes to an index will not be visible until the index is re-opened.
So, I'm trying to follow these guides, in a multi-threaded setup. (a few threads writing, multiple user threads searching). I don't want to re-open the index on every change, rather, I want to keep searcher instance not older than a certain amount of time (say, like 20 seconds).
A central component is responsible to open index readers and writers, and keep the single instance and synchronize the threads. I keep track of the last time the IndexSearcher has been accessed by any user thread, and the time it became dirty. If anyone needs to access it after 20 seconds has passed from the change, I want to close the searcher and re-open it.
The problem is that I'm not sure of the previously requests for the searcher (made by other threads) has finished yet, so that I can close the IndexSearcher. It means that if I close and re-open the single IndexSearcher instance that is shared among all threads, there might be a search going on concurrently in some other thread.
To make the matter worse, here's what can happen theoretically: there can be multiple searches being performed at the same time all the time. (suppose you have thousands of users running searches on the same index). The single IndexSearcher instance may never become free so that it can be closed. Ideally, I want to create another IndexSearcher and direct new requests to it (while the old one is still open and running the searches already requested before). When the searches running on the old instance are complete, I want to close it.
What is the best way to synchronize multiple users of the IndexSearcher (or IndexWriter) for calling the close() method? Does Lucene provide any features / facilities for this, or it should be done totally by the user code (like counting the threads using a searcher, and increase / decrease the count each time it is used)?
Are there any recommendation / ideas about the above mentioned design?
Thankfully in recent versions (3.x or late 2.x) they added a method to tell you if there has been any writing after the searcher had been opened. IndexReader.isCurrent() will tell you if any changes have occurred since this reader was open or not. So you probably will create a simple wrapper class that encapsulates both reading and writing, and with some simple synchronization you can provide 1 class that manages all of this between all of the threads.
Here is roughly what I do:
public class ArchiveIndex {
private IndexSearcher search;
private AtomicInteger activeSearches = new AtomicInteger(0);
private IndexWriter writer;
private AtomicInteger activeWrites = new AtomicInteger(0);
public List<Document> search( ... ) {
synchronized( this ) {
if( search != null && !search.getIndexReader().isCurrent() && activeSearches.get() == 0 ) {
searcher.close();
searcher = null;
}
if( search == null ) {
searcher = new IndexSearcher(...);
}
}
activeSearches.increment();
try {
// do you searching
} finally {
activeSearches.decrement();
}
// do you searching
}
public void addDocuments( List<Document> docs ) {
synchronized( this ) {
if( writer == null ) {
writer = new IndexWriter(...);
}
}
try {
activeWrites.incrementAndGet();
// do you writes here.
} finally {
synchronized( this ) {
int writers = activeWrites.decrementAndGet();
if( writers == 0 ) {
writer.close();
writer = null;
}
}
}
}
}
So I have single class that I use for both readers and writers. Notice this class allows writing and reading at the same time, and multiple readers can search at the same time. The only sync'ing is the quick checks to see if you need to reopen the searcher/writer. I didn't synchronize on the method level which would only allow one reader/writer at a time which would be bad performance wise. If there are active searchers out there you can't drop the searcher. So if you get lots of readers coming in it just simply searches without the changes. Once it slims out the next lone searcher will reopen the dirty searcher. This might be great for lower volume sites where there will be a pause in traffic. It could still cause starvation (ie you're always reading older and older results). You could add logic to simply stop and reinitialize if the time since it was noticed dirty is older than X otherwise we lazy as it is now. That way you'll be guaranteed searches will never be older than X.
Writers can be handled much in the same way. I tend to remember closing the writer periodically so the reader will notice its changed (commit it). I didn't do a very good job describing that, but it's much the same way of searching. If there are active writers out there you can't close the writer. If you're the last writer out the door close the writer. You get the idea.
There is a relatively new SearcherManager
class which takes care of this problem and can hide the IndexReader
from your code entirely. Though the API is possibly subject to change, I see this as greatly simplifying things.
A basic tutorial from Mike McCandless, a Lucene project comitter: http://blog.mikemccandless.com/2011/09/lucenes-searchermanager-simplifies.html
You would only want to create a new reader if the actual index has changed. What I did, was to keep a reference to IndexReader, and drop it after I've reindexed stuff. That's because I want to be able to search during indexing, and I believe that you can't open an IndexReader while writing (correct me if I'm wrong).
I let the application create a new reader if there is none available, so it's sort of a caching that gets disposed after each index commit.
If you need realtime indexing capabilities (searching amongst the currently indexed entities during an idnexing oepration), you can grab an IndexReader from the current IndexWriter using the getReader() method.