Suppose you have a large file made up of a bunch of fixed size blocks. Each of these blocks contains some number of variable sized records. Each record must fit completely within a single block and then such records by definition are never larger than a full block. Over time, records are added to and deleted from these blocks as records come and go from this "database".
At some point, especially after perhaps many records are added to the database and several are removed - many of the blocks may end up only partially filled.
What is a good algorithm to shuffle the records around in this database to compact out unnecessary blocks at the end of the file by better filling up the partially filled blocks?
Requirements of the algorithm:
- The compaction must happen in place of the original file without temporarily extending the file by more than a few blocks at most from its starting size
- The algorithm should not unnecessarily disturb blocks that are already mainly full
- Entire blocks must be read or written from/to the file at one time and one should assume the write operation is relatively expensive
- If records are moved from one block to another they must be added at their new location before being removed from their starting position so that in case the operation is interrupted no records are lost as a result of the "failed" compaction. (Assume that this temporary duplication of such records can be detected at recovery).
- The memory that can be used for this operation can only be on the order of perhaps several blocks which is a very small percentage of the overall file size
- Assume that records are on the order of 10 bytes to 1K bytes with an average size of maybe 100 bytes. The fixed sized blocks are on the order of 4K or 8K and that the file is on the order of 1000's of blocks.
This sounds like a variation of the bin packing problem, but where you already have an inferior allocation that you want to improve. So I suggest looking at variations of the approaches which are successful for the bin packing problem.
First of all, you probably want to parameterize your problem by defining what you consider "full enough" (where a block is full enough that you don't want to touch it), and what is "too empty" (where a block has so much empty space that it has to have more records added to it). Then, you can classify all the blocks as full enough, too empty, or partially full (those that are neither full enough nor too empty). You then redefine the problem as how to eliminate all the too empty blocks by creating as many full enough blocks as possible while minimising the number of partially full blocks.
You'll also want to work out what's more important: getting the records into the fewest blocks possible, or packing them adequately but with the least amount of blocks read and written.
My approach would be to make an initial pass over all the blocks, to classify them all into one of the three classes defined above. For each block, you also want to keep track of the free space available in it, and for the too empty blocks, you'll want to have a list of all the records and their sizes. Then, starting with the largest records in the too empty blocks, move them into the partially full blocks. If you want to minimise reads and writes, move them into any of the blocks you currently have in memory. If you want to minimise wasted space, find the block with the least amount of empty space that will still hold the record, reading the block in if necessary. If no block will hold the record, create a new block. If a block in memory reaches the "full enough" threshold, write it out. Repeat until all the records in the partially filled blocks have been placed.
I've skipped over more than a few details, but this should give you some idea. Note that the bin packing problem is NP-hard, which means that for practical applications, you'll want to decide what's most important to you, so you can pick an approach that will give you an approximately best solution in reasonable time.
If there is no ordering to these records, I'd simply fill the blocks from the front with records extracted from the last block(s). This will minimize movement of data, is fairly simple, and should do a decent job of packing data tightly.
E.g.:
// records should be sorted by size in memory (probably in a balanced BST)
records = read last N blocks on disk;
foreach (block in blocks) // read from disk into memory
{
if (block.hasBeenReadFrom())
{
// we read from this into records already
// all remaining records are already in memory
writeAllToNewBlocks(records);
// this will leave some empty blocks on the disk that can either
// be eliminated programmatically or left alone and filled during
// normal operation
foreach (record in records)
{
record.eraseFromOriginalLocation();
}
break;
}
while(!block.full())
{
moveRecords = new Array; // list of records we've moved
size = block.availableSpace();
record = records.extractBestFit(size);
if (record == null)
{
break;
}
moveRecords.add(record);
block.add(record);
if (records.gettingLow())
{
records.readMoreFromDisk();
}
}
if(moveRecords.size() > 0)
{
block.writeBackToDisk();
foreach (record in moveRecords)
{
record.eraseFromOriginalLocation();
}
}
}
Update: I neglected to maintain the no-blocks-only-in-memory rule. I've updated the pseudocode to fix this. Also fixed a glitch in my loop condition.
A modification of an on-line (to defragment in one pass) bounded space (the memory requirements) bin packing algorithm could probably work here.
See "Bin Packing Approximation Algorithms: Combinatorial Analysis" by Coffman et al.
Here's an algorithm you might be able to leverage, albeit your records within fixed size blocks might require a little bit more work.
Heap Defragmentation in Bounded Time