How to compress a String in Java?

2019-01-07 08:41发布

I use GZIPOutputStream or ZIPOutputStream to compress a String (my string.length() is less than 20), but the compressed result is longer than the original string.

On some site, I found some friends said that this is because my original string is too short, GZIPOutputStream can be used to compress longer strings.

so, can somebody give me a help to compress a String?

My function is like:

String compress(String original) throws Exception {

}

Update:

import java.io.ByteArrayOutputStream;
import java.io.IOException;
import java.util.zip.GZIPOutputStream;
import java.util.zip.*;


//ZipUtil 
public class ZipUtil {
    public static String compress(String str) {
        if (str == null || str.length() == 0) {
            return str;
        }

        ByteArrayOutputStream out = new ByteArrayOutputStream();
        GZIPOutputStream gzip = new GZIPOutputStream(out);
        gzip.write(str.getBytes());
        gzip.close();
        return out.toString("ISO-8859-1");
    }

    public static void main(String[] args) throws IOException {
        String string = "admin";
        System.out.println("after compress:");
        System.out.println(ZipUtil.compress(string));
    }
}

The result is :

alt text

10条回答
一夜七次
2楼-- · 2019-01-07 09:07

Huffman Coding might help, but only if you have a lot of frequent characters in your small String

查看更多
虎瘦雄心在
3楼-- · 2019-01-07 09:07

If you know that your strings are mostly ASCII you could convert them to UTF-8.

byte[] bytes = string.getBytes("UTF-8");

This may reduce the memory size by about 50%. However, you will get a byte array out and not a string. If you are writing it to a file though, that should not be a problem.

To convert back to a String:

private final Charset UTF8_CHARSET = Charset.forName("UTF-8");
...
String s = new String(bytes, UTF8_CHARSET);
查看更多
唯我独甜
4楼-- · 2019-01-07 09:13

The ZIP algorithm is a combination of LZW and Huffman Trees. You can use one of theses algorithms separately.

The compression is based on 2 factors :

  • the repetition of substrings in your original chain (LZW): if there are a lot of repetitions, the compression will be efficient. This algorithm has good performances for compressing a long plain text, since words are often repeated
  • the number of each character in the compressed chain (Huffman): more the repartition between characters is unbalanced, more the compression will be efficient

In your case, you should try the LZW algorithm only. Used basically, the chain can be compressed without adding meta-informations: it is probably better for short strings compression.

For the Huffman algorithm, the coding tree has to be sent with the compressed text. So, for a small text, the result can be larger than the original text, because of the tree.

查看更多
乱世女痞
5楼-- · 2019-01-07 09:15

Take a look at the Huffman algorithm.

https://codereview.stackexchange.com/questions/44473/huffman-code-implementation

The idea is that each character is replaced with sequence of bits, depending on their frequency in the text (the more frequent, the smaller the sequence).

You can read your entire text and build a table of codes, for example:

Symbol Code

a 0

s 10

e 110

m 111

The algorithm builds a symbol tree based on the text input. The more variety of characters you have, the worst the compression will be.

But depending on your text, it could be effective.

查看更多
再贱就再见
6楼-- · 2019-01-07 09:23

Huffman encoding is a sensible option here. Gzip and friends do this, but the way they work is to build a Huffman tree for the input, send that, then send the data encoded with the tree. If the tree is large relative to the data, there may be no not saving in size.

However, it is possible to avoid sending a tree: instead, you arrange for the sender and receiver to already have one. It can't be built specifically for every string, but you can have a single global tree used to encode all strings. If you build it from the same language as the input strings (English or whatever), you should still get good compression, although not as good as with a custom tree for every input.

查看更多
仙女界的扛把子
7楼-- · 2019-01-07 09:27

Compression algorithms almost always have some form of space overhead, which means that they are only effective when compressing data which is sufficiently large that the overhead is smaller than the amount of saved space.

Compressing a string which is only 20 characters long is not too easy, and it is not always possible. If you have repetition, Huffman Coding or simple run-length encoding might be able to compress, but probably not by very much.

查看更多
登录 后发表回答