Noticing that byte-pair encoding (BPE) is sorely lacking from the large text compression benchmark, I very quickly made a trivial literal implementation of it.
The compression ratio - considering that there is no further processing, e.g. no Huffman or arithmetic encoding - is surprisingly good.
The runtime of my trivial implementation was less than stellar, however.
How can this be optimized? Is it possible to do it in a single pass?
Yes, keep us posted.
guarantee?
BobMcGee gives good advice. However, I suspect that "Limit the block size to less than 65kB ... . This guarantees not all bytes will be used" is not always true. I can generate a (highly artificial) binary file less than 1kB long that has a byte pair that repeats 10 times, but cannot be compressed at all with BPE because it uses all 256 bytes -- there are no free bytes that BPE can use to represent the frequent byte pair.
If we limit ourselves to 7 bit ASCII text, we have over 127 free bytes available, so all files that repeat a byte pair enough times can be compressed at least a little by BPE. However, even then I can (artificially) generate a file that uses only the isgraph() ASCII characters and is less than 30kB long that eventually hits the "no free bytes" limit of BPE, even though there is still a byte pair remaining with over 4 repeats.
single pass
It seems like this algorithm can be slightly tweaked in order to do it in one pass. Assuming 7 bit ASCII plaintext: Scan over input text, remembering all pairs of bytes that we have seen in some sort of internal data structure, somehow counting the number of unique byte pairs we have seen so far, and copying each byte to the output (with high bit zero). Whenever we encounter a repeat, emit a special byte that represents a byte pair (with high bit 1, so we don't confuse literal bytes with byte pairs). Include in the internal list of byte "pairs" that special byte, so that the compressor can later emit some other special byte that represents this special byte plus a literal byte -- so the net effect of that other special byte is to represent a triplet. As phkahler pointed out, that sounds practically the same as LZW.
EDIT: Apparently the "no free bytes" limitation I mentioned above is not, after all, an inherent limitation of all byte pair compressors, since there exists at least one byte pair compressor without that limitation.
Have you seen "SCZ - Simple Compression Utilities and Library"? SCZ appears to be a kind of byte pair encoder. SCZ apparently gives better compression than other byte pair compressors I've seen, because SCZ doesn't have the "no free bytes" limitation I mentioned above.
If any byte pair BP repeats enough times in the plaintext (or, after a few rounds of iteration, the partially-compressed text), SCZ can do byte-pair compression, even when the text already includes all 256 bytes.
(SCZ uses a special escape byte E in the compressed text, which indicates that the following byte is intended to represent itself literally, rather than expanded as a byte pair. This allows some byte M in the compressed text to do double-duty: The two bytes EM in the compressed text represent M in the plain text. The byte M (without a preceeding escape byte) in the compressed text represents some byte pair BP in the plain text. If some byte pair BP occurs many more times than M in the plaintext, then the space saved by representing each BP byte pair as the single byte M in the compressed data is more than the space "lost" by representing each M as the two bytes EM.)
I don't believe this can be done in a single pass unless you find a way to predict, given a byte-pair replacement, if the new byte-pair (after-replacement) will be good for replacement too or not.
Here are my thoughts at first sight. Maybe you already do or have already thought all this.
I would try the following.
Two adjustable parameters:
I would do passes, as long as it is still worth compressing one more level (according to parameter 2). During each pass, I would keep a count of byte-pairs as I go.
I would play with the two parameters a little and see how it influences compression ratio and speed. Probably that they should change dynamically, according to the length of the chunk to compress (and maybe one or two other things).
Another thing to consider is the data structure used to store the count of each byte-pair during the pass. There very likely is a way to write a custom one which would be faster than generic data structures.
Keep us posted if you try something and get interesting results!