String to string compression algorithm?

2020-04-17 06:52发布

问题:

I'm looking for an algorithm that would compress some string to another string (i.e. without "\0" or special control characters), but I can't find anything on the internet. Is there such an algorithm? It doesn't have to be particularly efficient, just something basic.

回答1:

Apparently you have some specific character set in mind and you want to use it for both the original string and the compressed string.

Standard compression routines (e.g. gzip) work on byte strings.

One idea is to take existing code (e.g. gzip's) and rewrite it to use your character set instead of bytes.

Another is to construct a 1-to-1 mapping between strings in your character set and arbitrary byte strings, map the original string to a byte string, compress the byte string using a standard compression utility or function, and map the result back to a string using your character set. (Strictly speaking you can use two different mappings.)

One way to construct the mapping is to pad your character set with dummies and a special pad character until you have 2^k different characters (for some k); then each 8 of your characters correspond to k bytes (and shorter strings can be padded with the pad character).



回答2:

Easy:

$ echo "Hello world" | gzip -c | base64
H4sIALnHeU4AA/NIzcnJVyjPL8pJ4QIA1eA5twwAAAA=

$ echo "H4sIALnHeU4AA/NIzcnJVyjPL8pJ4QIA1eA5twwAAAA=" | base64 -d | gzip -dc
Hello world

Note: looks like there is no compression, but for bigger data the compression ratio will be better :-)



回答3:

Your requirement for no "special characters" is very restrictive, unless you can guarantee that a subset of characters (say "~") will never be used. Then you can use those characters to mark your compression:

~a -> the
~b -> The
~c -> and
~d -> And
~e -> Sirius Robotics Corporation Ltd.
etc.

Just add commonly used words to the codebook. The codebook can be fixed, as above, or vary with the text to be compressed. Either way the decompressing side will need access to the correct codebook to do the decompression.



回答4:

As far as I can tell, the most popular compression algorithm that allows standard C string-handling routines to be re-used to handle compressed text (i.e., carefully avoids putting any 0x00 bytes in the compressed string, except as the end-of-compressed-data marker) is simple byte-pair encoding, also called dual-tile encoding or DTE. DTE is often used to compress text in video game ROMs.

When the DTE decompressor prints out a DTE-compressed string, it reads 1 byte at a time from the DTE-compressed string and prints out 1 or two bytes:

  • compressed byte B in the range 0x01..0xFF: the decoder uses that as an index into the "dictionary" and prints out the 1 or 2 bytes stored in the dictionary at that index.
  • compressed byte B is 0x00, that's the end of the string -- done.

A typical DTE implementation has a hard-wired dictionary stored in both the encoder and the decoder something like this:

  • indexes of frequently-used letters -- perhaps the entire ASCII isprint() range 0x20 to 0x7e, and the newline character 0x0A -- represent themselves. (The compressed byte 'a' is decoded as the single letter 'a')
  • indexes from 0xc0 to 0xff: the byte is decoded into 2 characters: a space character, and a letter formed from this byte XORed with 0x80. (The compressed byte (0x80 xor 'a') is decoded into 2 characters, the space character and the letter 'a').
  • Any other available indexes ( 0x7f..0xbf ) store other common bigrams ("th", "re", etc.).