How to reduce the size of a generated Lucene/Solr

2019-02-18 23:00发布

问题:

I am working on a prototype of a search system.

I have a table in oracle with some fields. I generated data that looks real. Around 300.000 rows. For example:

PaymentNo|Datetime        |AmountEuro|PayersName            |PayersPhoneNo|ReceiversLegal|ReceiversAcc
2314     |2015-07-21T15:14|15.63     |Clinton, Barack Anjela|1.918.0060657|Nasa          |5555569778664190000
230338   |2015-08-01T15:14|34.87     |Merkel, George Donald |1.653.0060658|PepsiCo       |7777828443194736000

( actually there are more columns)

The size of table in oracle 62 MB (Toad reports)

I imported table into Solr 5.2.1 (in Windows). The size of index with data is 88 MB (on disk). The size of index without data is 67 MB.

My question is: Can I decrease the size of index?

These options are already tested: Decreasing the amount of indexed table columns. Switching off data storage in Solr. Excluding some part of rows from index.

I need an extra opportunity to decrease a size of an index. Do you know any?

回答1:

You can use all the insights provided here. Some additional points I wanted to share.

Solr does duplication of the data for providing the fast search over indexed data. One important thing about solr is, it uses immutable data structure for storing all the data.

  • Term Dictionary : Dictionary of indexed terms along with their frequency and offset to posting lists.
  • Term Vectors: Solr stores the term vector for each document indexed. This is essentially a separate inverted index for each document. This is usually storage heavy.
  • Stored Docs : stores each document with their fields in sequential order.
  • Doc values : stores fields for all the document together. This is similar to columnar storage of data.

You can disable the document level Term Vectors storage if you are not using solr highlighting feature of the solr.

Additionally, Solr uses many different compression techniques for different type of data. It uses bit packing/vint compression for posting lists and numerical values. LZ4 compression for stored fields and term vectors. It uses FST data structure for storing the Term Dictionary. FST is an special implementation of Trie data structure.



标签: solr lucene