Compressing UTF-8(or other 8-bit encoding) to 7 or

2019-07-22 22:42发布

I wish to take a file encoded in UTF-8 that doesn't use more than 128 different characters, then move it to a 7-bit encoding to save the 1/8 of space. For example, if I have a 16 MB text file that only uses the first 128(ascii) characters, I would like to shave off the extra bit to reduce the file to 14MB.

How would I go about doing this?

There doesn't seem to be an existing free or proprietary program to do so, so I was thinking I might try and make a simple(if inefficient) one.

The basic idea I have is to make a function from the current hex/decimal/binary values used for each character to the 128 values I would have in the seven bit encoding, then scan through the file and write each modified value to a new file.

So if the file looked like(I'll use a decimal example because I try not to have to think in hex)

127 254 025 212 015 015 132... It would become

001 002 003 004 005 005 006

If 127 mapped to 001, 254 mapped to 005, etc.

I'm not entirely sure on a couple things, though.

  1. Would this be enough to actually shorten the filesize? I have a bad feeling this would simply leave an extra 0 on the binary string--11011001 might get mapped to 01000001 rather than 1000001, and I won't actually save space. If this would happen, how do I get rid of the zero?
  2. How do I open the file to read/write in binary/decimal/hex rather than just text? I've mostly worked with Python, but I can muddle through C if I must.

Thank you.

6条回答
The star\"
2楼-- · 2019-07-22 22:51

Just use gzip compression, and save 60-70% with 0% effort!

查看更多
闹够了就滚
3楼-- · 2019-07-22 22:53

Your idea is unlikely to work. If you write the byte 0x05 into a file, the byte gets written, all 8 bits of it - with leading zeros. To actually accomplish what you need, you can encode each 8 bytes in 7 bytes (since you only need 8*7 bits to encode 8 values). One approach is keep the 7 values in the 7 low bits of their bytes, and spread the 8th byte over the 7 MSBits.

As for Python, opening a file in binary write mode is open(filename, 'wb'). You'll also have to learn about bit operations to pack bytes as described above.

Just a small example:

>>> a = 0x03
>>> b = 0x59
>>> c = ((a & 0x1) << 7) | b
>>> hex(c)
'0xd9'
>>> 

This places the lowest bit of a into the MSBit of c and the rest of c is the value of b.

I'm sure you can take it from here.

查看更多
4楼-- · 2019-07-22 22:56

Your idea is on the right track, but needs some development. If you're interested in this kind of data compression, you may want to investigate Huffman coding. This is a simple data compression technique that is used in many real-world situations.

I can recommend The Data Compression Book by Mark Nelson which is a great introduction to data compression techniques.

查看更多
Bombasti
5楼-- · 2019-07-22 23:01

"this would simply leave an extra 0 on the binary string--11011001 might get mapped to 01000001 rather than 1000001, and I won't actually save space."

Correct. Your plan will do nothing.

查看更多
做个烂人
6楼-- · 2019-07-22 23:03

What you need is UTF-7.

Edit: UTF-7 has the advantage of bloating "only" special characters, so if special characters are rare in the input, you get far less bytes than by just converting UTF-8 to 7 bit. That's what UTF-7 is for.

查看更多
做自己的国王
7楼-- · 2019-07-22 23:11

Do you understand that files are divided into bytes? Thus, if you did that, you'd have 7 bits of the first letter in bytes 1, plus 1 bit of the second letter, then in byte two, you'd have 6 bits of the second letter, and 2 bits of the third, so on. It would look like this:

|AAAAAAAB|BBBBBBCC|CCCCCDDD|DDDDEEEE|EEEFFFFF|FF...
 \------/ \------/ \------/ \------/ \------/
   byte     byte     byte     byte     byte
查看更多
登录 后发表回答