Compressing UTF-8(or other 8-bit encoding) to 7 or

2019-07-22 23:06发布

问题:

I wish to take a file encoded in UTF-8 that doesn't use more than 128 different characters, then move it to a 7-bit encoding to save the 1/8 of space. For example, if I have a 16 MB text file that only uses the first 128(ascii) characters, I would like to shave off the extra bit to reduce the file to 14MB.

How would I go about doing this?

There doesn't seem to be an existing free or proprietary program to do so, so I was thinking I might try and make a simple(if inefficient) one.

The basic idea I have is to make a function from the current hex/decimal/binary values used for each character to the 128 values I would have in the seven bit encoding, then scan through the file and write each modified value to a new file.

So if the file looked like(I'll use a decimal example because I try not to have to think in hex)

127 254 025 212 015 015 132... It would become

001 002 003 004 005 005 006

If 127 mapped to 001, 254 mapped to 005, etc.

I'm not entirely sure on a couple things, though.

  1. Would this be enough to actually shorten the filesize? I have a bad feeling this would simply leave an extra 0 on the binary string--11011001 might get mapped to 01000001 rather than 1000001, and I won't actually save space. If this would happen, how do I get rid of the zero?
  2. How do I open the file to read/write in binary/decimal/hex rather than just text? I've mostly worked with Python, but I can muddle through C if I must.

Thank you.

回答1:

Just use gzip compression, and save 60-70% with 0% effort!



回答2:

Do you understand that files are divided into bytes? Thus, if you did that, you'd have 7 bits of the first letter in bytes 1, plus 1 bit of the second letter, then in byte two, you'd have 6 bits of the second letter, and 2 bits of the third, so on. It would look like this:

|AAAAAAAB|BBBBBBCC|CCCCCDDD|DDDDEEEE|EEEFFFFF|FF...
 \------/ \------/ \------/ \------/ \------/
   byte     byte     byte     byte     byte


回答3:

Your idea is on the right track, but needs some development. If you're interested in this kind of data compression, you may want to investigate Huffman coding. This is a simple data compression technique that is used in many real-world situations.

I can recommend The Data Compression Book by Mark Nelson which is a great introduction to data compression techniques.



回答4:

Your idea is unlikely to work. If you write the byte 0x05 into a file, the byte gets written, all 8 bits of it - with leading zeros. To actually accomplish what you need, you can encode each 8 bytes in 7 bytes (since you only need 8*7 bits to encode 8 values). One approach is keep the 7 values in the 7 low bits of their bytes, and spread the 8th byte over the 7 MSBits.

As for Python, opening a file in binary write mode is open(filename, 'wb'). You'll also have to learn about bit operations to pack bytes as described above.

Just a small example:

>>> a = 0x03
>>> b = 0x59
>>> c = ((a & 0x1) << 7) | b
>>> hex(c)
'0xd9'
>>> 

This places the lowest bit of a into the MSBit of c and the rest of c is the value of b.

I'm sure you can take it from here.



回答5:

"this would simply leave an extra 0 on the binary string--11011001 might get mapped to 01000001 rather than 1000001, and I won't actually save space."

Correct. Your plan will do nothing.



回答6:

What you need is UTF-7.

Edit: UTF-7 has the advantage of bloating "only" special characters, so if special characters are rare in the input, you get far less bytes than by just converting UTF-8 to 7 bit. That's what UTF-7 is for.