optimizing byte-pair encoding

2019-02-05 02:28发布

Noticing that byte-pair encoding (BPE) is sorely lacking from the large text compression benchmark, I very quickly made a trivial literal implementation of it.

The compression ratio - considering that there is no further processing, e.g. no Huffman or arithmetic encoding - is surprisingly good.

The runtime of my trivial implementation was less than stellar, however.

How can this be optimized? Is it possible to do it in a single pass?

8条回答
放荡不羁爱自由
2楼-- · 2019-02-05 03:01

You can also optimize the dictionary so that:

AA1BB2CC3DD4EE5FF6GG7HH8 is a sequential run of 8 token.

Rewrite that as:

AA1<255>BBCCDDEEFFGGHH<255> where the <255> tells the program that each of the following byte pairs (up to the next <255>) are sequential and incremented by one. Works great for text files and any where there are at least 4 sequential tokens.

save 175 bytes on recent test.

查看更多
贼婆χ
3楼-- · 2019-02-05 03:03

the easiest efficient structure is a 2 dimensional array like byte_pair(255,255). Drop the counts in there and modify as the file compresses.

查看更多
唯我独甜
4楼-- · 2019-02-05 03:05

Code in JustBasic can be found here complete with input text file.

Just BASIC Files Archive – forum post

EBPE by TomC 02/2014 – Ehanced Byte Pair Encoding

EBPE features two post processes to Byte Pair Encoding


1. Is compressing the dictionary (believed to be a novelty)

A dictionary entry is composed of 3 bytes:

AA – the two char to be replaced by (byte pair)
 1 – this single token (tokens are unused symbols)

So "AA1" tells us when decoding that every time we see a "1" in the data file, replace it with "AA".

While long runs of sequential tokens are possible, let’s look at this 8 token example:

AA1BB3CC4DD5EE6FF7GG8HH9

It is 24 bytes long (8 * 3)

The token 2 is not in the file indicating that it was not an open token to use, or another way to say it: the 2 was in the original data.

We can see the last 7 tokens 3,4,5,6,7,8,9 are sequential so any time we see a sequential run of 4 tokens or more, let’s modify our dictionary to be:

AA1BB3<255>CCDDEEFFGGHH<255>

Where the <255> tells us that the tokens for the byte pairs are implied and are incremented by 1 more than the last token we saw (3). We increment by one until we see the next <255> indicating an end of run.

  • The original dictionary was 24 bytes,
  • The enhanced dictionary is 20 bytes.

I saved 175 bytes using this enhancement on a text file where tokens 128 to 254 would be in sequence as well as others in general, to include the run created by lowercase pre-processing.

2. Is compressing the data file

Re-using rarely used characters as tokens is nothing new.

After using all of the symbols for compression (except for <255>), we scan the file and find a single "j" in the file. Let this char do double duty by:

  • "<255>j" means this is a literal "j"
  • "j" is now used as a token for re-compression,

If the j occurred 1 time in the data file, we would need to add 1 <255> and a 3 byte dictionary entry, so we need to save more than 4 bytes in BPE for this to be worth it.

If the j occurred 6 times we would need 6 <255> and a 3 byte dictionary entry so we need to save more than 9 bytes in BPE for this to be worth it.

Depending on if further compression is possible and how many byte pairs remain in the file, this post process has saved in excess of 100 bytes on test runs.

Note: When decompressing make sure not to decompress every "j". One needs to look at the prior character to make sure it is not a <255> in order to decompress. Finally, after all decompression, go ahead and remove the <255>'s to recreate your original file.

3. What’s next in EBPE?

Unknown at this time

查看更多
女痞
5楼-- · 2019-02-05 03:09

This is a summary of my progress so far:

Googling found this little report that links to the original code and cites the source:

Philip Gage, titled 'A New Algorithm for Data Compression', that appeared in 'The C Users Journal' - February 1994 edition.

The links to the code on Dr Dobbs site are broken, but that webpage mirrors them.

That code uses a hash table to track the the used digraphs and their counts each pass over the buffer, so as to avoid recomputing fresh each pass.

My test data is enwik8 from the Hutter Prize.

|----------------|-----------------|
| Implementation | Time (min.secs) |
|----------------|-----------------|
| bpev2          | 1.24            | //The current version in the large text benchmark
| bpe_c          | 1.07            | //The original version by Gage, using a hashtable
| bpev3          | 0.25            | //Uses a list, custom sort, less memcpy
|----------------|-----------------|

bpev3 creates a list of all digraphs; the blocks are 10KB in size, and there are typically 200 or so digraphs above the threshold (of 4, which is the smallest we can gain a byte by compressing); this list is sorted and the first subsitution is made.

As the substitutions are made, the statistics are updated; typically each pass there is only around 10 or 20 digraphs changed; these are 'painted' and sorted, and then merged with the digraph list; this is substantially faster than just always sorting the whole digraph list each pass, since the list is nearly sorted.

The original code moved between a 'tmp' and 'buf' byte buffers; bpev3 just swaps buffer pointers, which is worth about 10 seconds runtime alone.

Given the buffer swapping fix to bpev2 would bring the exhaustive search in line with the hashtable version; I think the hashtable is arguable value, and that a list is a better structure for this problem.

Its sill multi-pass though. And so its not a generally competitive algorithm.

If you look at the Large Text Compression Benchmark, the original bpe has been added. Because of it's larger blocksizes, it performs better than my bpe on on enwik9. Also, the performance gap between the hash-tables and my lists is much closer - I put that down to the march=PentiumPro that the LTCB uses.

There are of course occasions where it is suitable and used; Symbian use it for compressing pages in ROM images. I speculate that the 16-bit nature of Thumb binaries makes this a straightforward and rewarding approach; compression is done on a PC, and decompression is done on the device.

查看更多
我命由我不由天
6楼-- · 2019-02-05 03:09

I've done work with optimizing a LZF compression implementation, and some of the same principles I used to improve performance are usable here.

To speed up performance on byte-pair encoding:

  1. Limit the block size to less than 65kB (probably 8-16 kB will be optimal). This guarantees not all bytes will be used, and allows you to hold intermediate processing info in RAM.
  2. Use a hashtable or simple lookup table by short integer (more RAM, but faster) to hold counts for a byte pairs. There are 65,656 2-byte pairs, and BlockSize instances possible (max blocksize 64k). This gives you a table of 128k possible outputs.
  3. Allocate and reuse data structures capable of holding a full compression block, replacement table, byte-pair counts, and output bytes in memory. This sounds wasteful of RAM, but when you consider that your block size is small, it's worth it. Your data should be able to sit entirely in CPU L2 or (worst case) L3 cache. This gives a BIG speed boost.
  4. Do one fast pass over the data to collect counts, THEN worry about creating your replacement table.
  5. Pack bytes into integers or short ints whenever possible (applicable mostly to C/C++). A single entry in the counting table can be represented by an integer (16-bit count, plus byte pair).
查看更多
戒情不戒烟
7楼-- · 2019-02-05 03:13

Here is a new BPE(http://encode.ru/threads/1874-Alba). Example for compile, gcc -O1 alba.c -o alba.exe It's faster than default.

查看更多
登录 后发表回答