I'm working on a Huffman coding/decoding project in C and have a good understanding of how the algorithm should store information about the Huffman tree, re-build the tree during decoding, and decompress to the original input file using variable-length codes.
When writing to my compressed file, I will output a table of 256 4-byte integers containing unique frequencies, and I know I will also have to figure out a way to handle EOF - worrying about that later.
My question is how should I complete the necessary bit-wise operations to write a stream of variable-length codes to a series of 1-byte iterations of fwrite.
If I've created the following (fictitious) codes:
a: 001010101010011
b: 100
c: 11111
d: 0
The bitstream for "abcd" would be:
001010101010011100111110
I know I'll need to use some bit-wise operations to "chop" this stream up into writeable bytes:
00101010|10100111|00111110
A first attempt at creating 8 different cases based upon lengths of the codes did not work out well and I'm stumped. Is there an easier way to handle variable-length codes when writing to a file?
Thank you
Here's some pseudo-code to give you the general idea:
static byte BitBuffer = 0;
static byte BitsinBuffer = 0;
static void WriteBitCharToOutput(char bitChar);
// buffer one binary digit ('1' or '0')
{
if (BitsInBuffer > 7)
{
stream.write(BitBuffer);
BitsInBuffer = 0;
BitBuffer = 0; // just to be tidy
}
BitBuffer = (BitBuffer << 1) | (bitChar == '1' ? 1 : 0);
BitsInBuffer++;
}
static void FlushBitBuffer()
// call after last character has been encoded
// to flush out remaining bits
{
if (BitsInBuffer > 0)
do
{
WriteBitCharToOutput('0'); // pad with zeroes
} while (BitsInBuffer != 1);
}
As an alternative to the other answer, if you want to write several bits at once to your buffer, you can. It could look something like this: (this is meant to be pseudocode, though it looks fairly real)
uint32_t buffer = 0;
int bufbits = 0;
for (int i = 0; i < symbolCount; i++)
{
int s = symbols[i];
buffer <<= lengths[s]; // make room for the bits
bufbits += lengths[s]; // buffer got longer
buffer |= values[s]; // put in the bits corresponding to the symbol
while (bufbits >= 8) // as long as there is at least a byte in the buffer
{
bufbits -= 8; // forget it's there
writeByte((buffer >> bufbits) & 0xFF); // and save it
}
}
Not shown: obviously you have to save anything left over in the buffer when you're done writing to it.
This assumes that the maximum code length is 25 or less. The maximum number of bits that can be left in the buffer is 7, 7+25 is the longest thing that fits in a 32 bit integer. This is not a bad limitation, usually the code length is limited to 15 or 16 to allow the simplest form of table-based decoding without needing a huge table.