Lucene.NET - checking if document exists in index

2019-08-20 02:09发布

I have the following code, using Lucene.NET V4, to check if a file exists in my index.

bool exists = false;
IndexReader reader = IndexReader.Open(Lucene.Net.Store.FSDirectory.Open(lucenePath), false);
Term term = new Term("filepath", "\\myFile.PDF");
TermDocs docs = reader.TermDocs(term);
if (docs.Next())
{
   exists = true;
}

The file myFile.PDF definitely exists, but it always comes back as false. When I look at docs in debug, its Doc and Freq properties state that they "threw an exception of type 'System.NullReferenceException'.

2条回答
叛逆
2楼-- · 2019-08-20 02:38

You may have analyzed the field "filepath" during indexing with an analyzer which tokenizes/changes the content. e.g. the StandardAnalyzer tokenizes, lowercases, removes stopwords if specified etc.

If you only need to query with the exact filepath like in your example use the KeywordAnalyzer during indexing for this field.

If you can't re-index at the moment you need to find out which analyzer is used during indexing and use it to create your query. You have two options:

  1. Use a query parser with the right analyzer and parse the query filepath:\\myFile.PDF. If the resultung query is a TermQuery you can use its term as you did in your example. Otherwise perform a search with the query.
  2. Use the Analyzer directly to create the terms from the TokenStream object. Again, if only one term, do it as you did, if multipe terms, create a phrase query.
查看更多
老娘就宠你
3楼-- · 2019-08-20 02:41

First of all, it's a good practice to use the same instance of the IndexReader if you're not going to consider deleted documents - it's going to perform better and it's thread-safe so you can make a static read-only field out of it (although, I can see that you're specifying false for readOnly parameter so in case this is intended, just ignore this paragraph).

As for your case, are you tokenizing filepath field values? Because if you are (e.g. by using StandardAnalyzer when indexing/searching), you will probably have problems finding values such as \myFile.PDF (with default tokenizer, the value is going to be split into myFile and PDF, not sure about the leading backslash).

Hope this helps.

查看更多
登录 后发表回答