Lucene fieldNorm discrepancy between Similarity ca

2019-01-23 21:22发布

问题:

I'm trying to understand how fieldNorm is calculated (at index time) and then used (and apparentlly re-calculated) at query time.

In all the examples I'm using the StandardAnalyzer with no stop words.

Deugging the DefaultSimilarity's computeNorm method while indexing stuff, I've noticed that for 2 particular documents it returns:

  • 0.5 for document A (which has 4 tokens in its field)
  • 0.70710677 for document B (which has 2 tokens in its field)

It does this by using the formula:

state.getBoost() * ((float) (1.0 / Math.sqrt(numTerms)));

where boost is always 1

Afterwards, when I query for these documents I see that in the query explain I get

  • 0.5 = fieldNorm(field=titre, doc=0) for document A
  • 0.625 = fieldNorm(field=titre, doc=1) for document B

This is already strange (to me, I'm sure it's me who's missing something). Why don't I get the same values for field norm as those calculated at index time? Is this the "query normalization" thing in action? If so, how does it work?

This however is more or less ok since the two query-time fieldNorms give the same order as those calculated at index time (the field with the shorter value has the higher fieldNorm in both cases)

I've then made my own Similarity class where I've implemented the computeNorms method like so:

public float computeNorm(String pField, FieldInvertState state) {
    norm = (float) (state.getBoost() + (1.0d / Math.sqrt(state.getLength())));
    return norm;
}

At index time I now get:

  • 1.5 for document A (which has 4 tokens in its field)
  • 1.7071068 for document B (which has 2 tokens in its field)

However now, when I query for these documents, I can see that they both have the same field norm as reported by the explain function:

  • 1.5 = fieldNorm(field=titre, doc=0) for document A
  • 1.5 = fieldNorm(field=titre, doc=1) for document B

To me, this is now really strange, how come if I use an apparently good similarity to calculate the fieldNorm at index time, which gives me proper values proportional to the number of tokens, later on, at query time, all this is lost and the query sais both documents have the same field norm?

So my questions are:

  • why does the index time fieldNorm as reported by the Similarity's computeNorm method not remain the same as that reported by query explain?
  • why, for two different fieldNorm values obtained at index time (via similarity computeNorm) I get identical fieldNorm values at query time?

== UPDATE

Ok, I've found something in Lucene's docs which clarifies some of my question, but not all of it:

However the resulted norm value is encoded as a single byte before being stored. At search time, the norm byte value is read from the index directory and decoded back to a float norm value. This encoding/decoding, while reducing index size, comes with the price of precision loss - it is not guaranteed that decode(encode(x)) = x. For instance, decode(encode(0.89)) = 0.75.

How much precision loss is there? Is there a minimum gap we should put between different values so that they remain different even after the precision-loss re-calculations?

回答1:

The documentation of encodeNormValue describes the encoding step (which is where the precision is lost), and particularly the final representation of the value:

The encoding uses a three-bit mantissa, a five-bit exponent, and the zero-exponent point at 15, thus representing values from around 7x10^9 to 2x10^-9 with about one significant decimal digit of accuracy. Zero is also represented. Negative numbers are rounded up to zero. Values too large to represent are rounded down to the largest representable value. Positive values too small to represent are rounded up to the smallest positive representable value.

The most relevant piece to understand that that the mantissa is only 3 bits, which means precision is around one significant decimal digit.

An important note on the rationale comes a few sentences after where your quote ended, where the Lucene docs say:

The rationale supporting such lossy compression of norm values is that given the difficulty (and inaccuracy) of users to express their true information need by a query, only big differences matter.