Is there a known math formula that I can use to estimate the size of a new Lucene index? I know how many fields I want to have indexed, and the size of each field. And, I know how many items will be indexed. So, once these are processed by Lucene, how does it translate into bytes?
- JCR-SQL - contains function doesn't escape spe
- Match lucene entire field exact value
- How to rank documents using tfidf similairty in lu
- Lucene Query on a DateField indexed by Solr
- How to token a word which combined by two words wi
- Solr - _version_ field must exist in schema and be
- CakePHP with Lucene
- Apache Lucene doesn't filter stop words despit
- Sort by date in Solr/Lucene performance problems
- What Solr tokenizer and filters can I use for a st
- Solr: How to dynamically elevate limited number of
- Finding a single fields terms with Lucene (PyLucen
- how to add custom stop words using lucene in java
I think it has to also do with the frequency of each term (i.e. an index of 10,000 copies of the sames terms should be much smaller than an index of 10,000 wholly unique terms).
Also, there's probably a small dependency on whether you're using Term Vectors or not, and certainly whether you're storing fields or not. Can you provide more details? Can you analyze the term frequency of your source data?
The index stores each "token" or text field etc., only the size is dependent on the nature of the material being indexed. Add to that whatever is being stored as well. One good approach might be to take a sample and index it, and use that to extrapolate out for the complete source collection. However, the ratio of index size to source size decreases over time as well, as the words are already there in the index, so you might want to make the sample a decent percentage of the original.
Here is the lucene index format documentation. The major file is the compound index (.cfs file). If you have term statistics, you can probably get an estimate for the .cfs file size, Note that this varies greatly based on the Analyzer you use, and on the field types you define.