lucene 4.0 statistics [duplicate]

2019-09-21 08:53发布

问题:

This question already has an answer here:

  • document length in lucene 4.0 1 answer

although this is a second time I'm posting the same question, the first one is here, but no answer, or partial answer. I've been struggling with this issue, and lost in the lucene api...

What I'm interested is, to get the document length from the Lucene. When I use searcher.explain (using bm25), I see that this feature exists, but I only need to fetch it.

I would highly appreciate an example, as I'm new to Lucene, just a point to API won't help.

One naive way to do it is to store this length in a seperate field, by using string.length() from java, and on query time retrieve it, however, this fature already exists (otherwise bm25 won't work) hence I don't want to store something redundatly.

I would highly appreciate it if you'd give a more detailed explanation on how to achieve this with the lucene 4.0, and if you're not able to provide and answer, please do not reply just for sake of replying (as then others are not reading my post thinking that it is solved!!!!), nor don't send me pointer to api (e.g. See Similarity.computeNorm by Robert Muir) because this won't help me. I need more details, like how to use this FieldInvertState, or Similarity.computeNorm??? On query time or index time??? small fragment of code would be helpful, you have to consider that I'm not an expert here, otherwise I wouldn't be asking

thanks in advance

回答1:

Yes, the newer the Lucene version you look at, the more daunting its complexity. Sometimes it helps to read the docs on an earlier version to see the basic principles more clearly.

Now to your case... Similarity is a Strategy-kind of object that you assign to the whole indexing process (IndexWriterConfig.setSimilarity). Its methods will be called to compute various pieces of information about each Document, and each of its Fields, being added to the index. So what Robert is suggesting here is to make your Similarity subclass (take the docs' advice and don't subclass Similarity directly, but rather one of the existing implementations, like DefaultSimilarity). Override the computeNorm method to produce the number that you want for the passed-in field. By default Lucene already computes that norm so that it tones down long fields, so I guess you have something more specific than that on your mind.

I would warmly suggest getting a hold of Lucene In Action if you want to get serious about leveraging Lucene.



标签: java lucene