Why gzip compressed buffer size is greater then un

I'm trying to write a compress utils class.
But during the test, I find the result it greater than original buffer.
Is my codes right ?

Please see codes:

/**
 * This class provide compress ability
 * <p>
 * Support:
 * <li>GZIP
 * <li>Deflate
 */
public class CompressUtils {
    final public static int DEFAULT_BUFFER_SIZE = 4096; // Compress/Decompress buffer is 4K

    /**
     * GZIP Compress
     * 
     * @param data The data will be compressed
     * @return The compressed data
     * @throws IOException
     */
    public static byte[] gzipCompress(byte[] data) throws IOException {
        Validate.isTrue(ArrayUtils.isNotEmpty(data));

        ByteArrayInputStream bis = new ByteArrayInputStream(data);
        ByteArrayOutputStream bos = new ByteArrayOutputStream();

        try {
            gzipCompress(bis, bos);
            bos.flush();
            return bos.toByteArray();
        } finally {
            bis.close();
            bos.close();
        }
    }

    /**
     * GZIP Decompress
     * 
     * @param data The data to be decompressed
     * @return The decompressed data
     * @throws IOException
     */
    public static byte[] gzipDecompress(byte[] data) throws IOException {
        Validate.isTrue(ArrayUtils.isNotEmpty(data));

        ByteArrayInputStream bis = new ByteArrayInputStream(data);
        ByteArrayOutputStream bos = new ByteArrayOutputStream();

        try {
            gzipDecompress(bis, bos);
            bos.flush();
            return bos.toByteArray();
        } finally {
            bis.close();
            bos.close();
        }
    }

    /**
     * GZIP Compress
     * 
     * @param is The input stream to be compressed
     * @param os The compressed result
     * @throws IOException
     */
    public static void gzipCompress(InputStream is, OutputStream os) throws IOException {
        GZIPOutputStream gos = null;

        byte[] buffer = new byte[DEFAULT_BUFFER_SIZE];
        int count = 0;

        try {
            gos = new GZIPOutputStream(os);
            while ((count = is.read(buffer)) != -1) {
                gos.write(buffer, 0, count);
            }
            gos.finish();
            gos.flush();
        } finally {
            if (gos != null) {
                gos.close();
            }
        }
    }

    /**
     * GZIP Decompress
     * 
     * @param is The input stream to be decompressed
     * @param os The decompressed result
     * @throws IOException
     */
    public static void gzipDecompress(InputStream is, OutputStream os) throws IOException {
        GZIPInputStream gis = null;

        int count = 0;
        byte[] buffer = new byte[DEFAULT_BUFFER_SIZE];

        try {
            gis = new GZIPInputStream(is);
            while ((count = is.read(buffer)) != -1) {
                os.write(buffer, 0, count);
            }
        } finally {
            if (gis != null) {
                gis.close();
            }
        }
    }
}

And here's the testing codes:

public class CompressUtilsTest {
    private Random random = new Random();

    @Test
    public void gzipTest() throws IOException {
        byte[] buffer = new byte[1023];
        random.nextBytes(buffer);
        System.out.println("Orignal: " + Hex.encodeHexString(buffer));

        byte[] result = CompressUtils.gzipCompress(buffer);
        System.out.println("Compressed: " + Hex.encodeHexString(result));

        byte[] decompressed = CompressUtils.gzipDecompress(result);
        System.out.println("DeCompressed: " + Hex.encodeHexString(decompressed));

        Assert.assertArrayEquals(buffer, decompressed);
    }
}

And the result is: original is 1023 bytes long compressed is 1036 bytes long

How is it happen ?

回答1:

In your test you initialize the buffer with a set of random characters.

GZIP consists of two parts:

LZW compression
Encoding using a Huffman code

The former relies heavily on repeated sequences in the input. Basically it says something like: "The next 10 characters are the same as the 10 characters staring at index X". In your case there are (possibly) no such repeated sequences, thus no compression by the first algorithm.

The Huffman encoding on the other hand should work, but in total the GZIP overhead (storing the used Huffman encoding, e.g.) outweighs the advantages of compressing the input.

If you test your algorithm with real files, you will get some meaningful results.

Best results are usually acquired when trying to compress structured files like XML.

回答2:

It's because compression generally works great on medium to large data length (1023 bytes is quite small) and moreover it also works the best on data that contains repeated patterns not on random ones.