I want to purge SOLR index whenever the index occupies more than 10% of the total disk space. The purge should result in deletion of the oldest documents that will bring the index space to less than 10% of the total space. How can I go about finding these oldest documents?
I thought of finding the size of a single document and using that as the base to determine how many docs to delete(sort by date asc and rows = N). Is there an other way to go about it? Thanks.
When you are indexing your documents, you can enable a timestamp field that will record the date and time when the document is added to the index. Then you can query against the timestamp field to determine the oldest documents. Here is an example that used to be included in the Solr example schema.xml, but was dropped in more recent versions.
<!-- Uncommenting the following will create a "timestamp" field using
a default value of "NOW" to indicate when each document was indexed.
-->
<!--
<field name="timestamp" type="date" indexed="true" stored="true" default="NOW" multiValued="false"/>
-->
Your strategy for determining the average size of a document and removing a set number based on that sounds like a valid option.
I think you can try this:
- Get an average document size, using (averageDocSize = indexSize/totalDocuments).
- Calculate the 10% size (sizeToDelete = indexSize * 0.1).
- Calculate the documents count to delete (n = sizeToDelete/averageDocSize).
- Use your previous query to get the oldest n documents.
- Delete the documents
Index size
Total number of documents