How do I estimate the size of a Lucene index?

2019-04-24 01:09发布

Is there a known math formula that I can use to estimate the size of a new Lucene index? I know how many fields I want to have indexed, and the size of each field. And, I know how many items will be indexed. So, once these are processed by Lucene, how does it translate into bytes?

标签： lucene

3条回答

走好不送

2楼-- · 2019-04-24 01:49

I think it has to also do with the frequency of each term (i.e. an index of 10,000 copies of the sames terms should be much smaller than an index of 10,000 wholly unique terms).

Also, there's probably a small dependency on whether you're using Term Vectors or not, and certainly whether you're storing fields or not. Can you provide more details? Can you analyze the term frequency of your source data?

0人赞添加讨论(0) 举报

SAY GOODBYE

3楼-- · 2019-04-24 02:13

The index stores each "token" or text field etc., only once...so the size is dependent on the nature of the material being indexed. Add to that whatever is being stored as well. One good approach might be to take a sample and index it, and use that to extrapolate out for the complete source collection. However, the ratio of index size to source size decreases over time as well, as the words are already there in the index, so you might want to make the sample a decent percentage of the original.

0人赞添加讨论(0) 举报

看我几分像从前

4楼-- · 2019-04-24 02:15

Here is the lucene index format documentation. The major file is the compound index (.cfs file). If you have term statistics, you can probably get an estimate for the .cfs file size, Note that this varies greatly based on the Analyzer you use, and on the field types you define.

0人赞添加讨论(0) 举报

How do I estimate the size of a Lucene index?

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间