Solr date field tdate vs date?

2019-03-09 16:43发布

问题:

So I have a question about Solr's field date types which is pretty straight forward: what's the difference between a 'date' field and a 'tdate' one?

The schema .xml claims that 'For faster range queries, consider the tdate type' and 'A Trie based date field for faster date range queries and date faceting. ' Fair enough... but what's the precisionStep="6" all about? should i change this? does it change the way i would create the query in case I use the tdate? What's the real advantage or what does Solr do that makes it better?

P.S went through google, Solr manual, solr wiki and the java docs without any luck so I'd appreciate a kind and explanatory answer :)... Also checked: http://www.lucidimagination.com/blog/2009/05/13/exploring-lucene-and-solrs-trierange-capabilities/ http://web.archiveorange.com/archive/v/AAfXfqRYyLnDFtskmLRi

回答1:

Good question :-) ! I read a good answer somewhere, sadly cannot find this again.

Basically trie ranges are faster. Here is one explanation. With precisionStep you configure how much your index can grow to get the performance benefits. To quote from the link you are referring:

"More importantly, it is not dependent on the index size, but instead the precision chosen."

and

"the only drawbacks of TrieRange are a little bit larger index sizes, because of the additional terms indexed"



回答2:

Trie fields make range queries faster by precomputing certain range results and storing them as a single record in the index. For clarity, my example will use integers in base ten. The same concept applies to all trie types. This includes dates, since a date can be represented as the number of seconds since, say, 1970.

Let's say we index the number 12345678. We can tokenize this into the following tokens.

12345678
123456xx
1234xxxx
12xxxxxx

The 12345678 token represents the actual integer value. The tokens with the x digits represent ranges. 123456xx represents the range 12345600 to 12345699, and matches all the documents that contain a token in that range.

Notice how in each token on the list has successively more x digits. This is controlled by the precision step. In my example, you could say that I was using a precision step of 2, since I trim 2 digits to create each extra token. If I were to use a precision step of 3, I would get these tokens.

12345678
12345xxx
12xxxxxx

A precision step of 4:

12345678
1234xxxx

A precision step of 1:

12345678
1234567x
123456xx
12345xxx
1234xxxx
123xxxxx
12xxxxxx
1xxxxxxx

It's easy to see how a smaller precision step results in more tokens and increases the size of the index. However, it also speeds up range queries.

Without the trie field, if I wanted to query a range from 1250 to 1275, Lucene would have to fetch 25 entries (1250, 1251, 1252, ..., 1275) and combine search results. With a trie field (and precision step of 1), we could get away with fetching 8 entries (125x, 126x, 1270, 1271, 1272, 1273, 1274, 1275), because 125x is a precomputed aggregation of 1250 - 1259. If I were to use a precision step larger than 1, the query would go back to fetching all 25 individual entries.

Note: In reality, the precision step refers to the number of bits trimmed for each token. If you were to write your numbers in hexadecimal, a precision step of 4 would trim one hex digit for each token. A precision step of 8 would trim two hex digits.



回答3:

Your best bet is to just look at the source code. Some of the things for Solr aren't well documented and the fastest way to get a trustworthy answer is to simply look at the code. If you haven't been in the code yet, that too is to your benefit. At least in the long run.

Here's a link to the TrieTokenizerFactory.

http://www.jarvana.com/jarvana/view/org/apache/solr/solr-core/1.4.1/solr-core-1.4.1-sources.jar!/org/apache/solr/analysis/TrieTokenizerFactory.java?format=ok

The javadoc in the class at least hints at the purpose of the precisionStep. You could dig futher.

EDIT: I dug a bit further for you. It's passed off directly to Lucene's NumericTokenStream class, which will used the value during parsing the token stream. Probably worth closer examination. It seems to deal with granularity and is probably a tradeoff between size in the index and speed.



标签: date solr