Huffman Tree in Java

2019-04-17 12:22发布

问题:

I have a problem with my Huffman tree code. In the main method I input a String of Symbols and I also input an Integer array containing the frequency of the Symbols. It should print out each Symbol and its Huffman code, but I think its wrong...

Here is the code:

 package huffman;

import java.util.*;

abstract class HuffmanTree implements Comparable<HuffmanTree> {
    public final int frequency; // the frequency of this tree
    public HuffmanTree(int freq) { frequency = freq; }

    // compares on the frequency
    public int compareTo(HuffmanTree tree) {
        return frequency - tree.frequency;
    }
}

class HuffmanLeaf extends HuffmanTree {
    public final char value; // the character this leaf represents

    public HuffmanLeaf(int freq, char val) {
        super(freq);
        value = val;
    }
}

class HuffmanNode extends HuffmanTree {
    public final HuffmanTree left, right; // subtrees

    public HuffmanNode(HuffmanTree l, HuffmanTree r) {
        super(l.frequency + r.frequency);
        left = l;
        right = r;
    }
}

public class Huffman {
    // input is an array of frequencies, indexed by character code
    public static HuffmanTree buildTree(int[] charFreqs, char[] test2) {
        PriorityQueue<HuffmanTree> trees = new PriorityQueue<HuffmanTree>();
        // initially, we have a forest of leaves
        // one for each non-empty character
        for (int i = 0; i < charFreqs.length; i++)
            if (charFreqs[i] > 0)
                trees.offer(new HuffmanLeaf(charFreqs[i], test2[i]));

        assert trees.size() > 0;
        // loop until there is only one tree left
        while (trees.size() > 1) {
            // two trees with least frequency
            HuffmanTree a = trees.poll();
            HuffmanTree b = trees.poll();

            // put into new node and re-insert into queue
            trees.offer(new HuffmanNode(a, b));
        }
        return trees.poll();
    }

    public static void printCodes(HuffmanTree tree, StringBuffer prefix) {
        assert tree != null;
        if (tree instanceof HuffmanLeaf) {
            HuffmanLeaf leaf = (HuffmanLeaf)tree;

            // print out character, frequency, and code for this leaf (which is just the prefix)
            System.out.println(leaf.value + "\t" + leaf.frequency + "\t" + prefix);

        } else if (tree instanceof HuffmanNode) {
            HuffmanNode node = (HuffmanNode)tree;

            // traverse left
            prefix.append('0');
            printCodes(node.left, prefix);
            prefix.deleteCharAt(prefix.length()-1);

            // traverse right
            prefix.append('1');
            printCodes(node.right, prefix);
            prefix.deleteCharAt(prefix.length()-1);
        }
    }

    public static void main(String[] args) {
        //Symbols:
        String str = "12345678"; 
        char[] test2 = str.toCharArray();
        //Frequency (of the symbols above):
        int[] charFreqs = {36,18,12,9,7,6,5,4};


        // build tree
        HuffmanTree tree = buildTree(charFreqs,test2);

        // print out results
        System.out.println("SYMBOL\tFREQ\tHUFFMAN CODE");
        printCodes(tree, new StringBuffer());
    }
}

The output I get is:

SYMBOL  FREQ    HUFFMAN CODE
1           36          0
3           12          100
6           6           1010
5           7           1011
2           18          110
4           9           1110
8           4           11110
7           5           11111

Thats weird, for example Symbol 7 should be: 11110 and Symbol 8 should be: 11111

Can you help me please?

回答1:

The assignment of the bit patterns does not matter to the optimality of the code. The assignment you have will work just fine. There is nothing weird about it. You could have also expressed concern over 2:110, 3:100, or 4:1110, 5:1011, but those are fine too.

The only reason to impose an order on the codes is to reduce the number of bits needed to convey the code from the compressor to the decompressor. Instead of sending the codes, you can send the code lengths for each symbol, so long as the code is constructed identically on both sides from just the lengths.

In that case, the approach is usually to assign the code in numerical order to a sorted list of symbols. Then you would indeed have the symbol 7 with a lower code "value" than the symbol 8, if that's the order in which they are assigned.

For your example, such a canonical code would be:

1: 1 - 0
2: 3 - 100
3: 3 - 101
4: 4 - 1100
5: 4 - 1101
6: 4 - 1110
7: 5 - 11110
8: 5 - 11111

You simply takes the lengths and within the same length, sort the symbols. Then assign codes starting with 0 and incrementing, adding bits to the end as you step up lengths.

Note that this is an unusual example, where the symbol order is also the frequency order. Normally that's not the case.



回答2:

Just You add one more 0 to understand finish bit. (over 3 bit reading)

1 36 0 3 12 100 6 6 1010 5 7 1011'0 2 18 110 4 9 1110 8 4 11110 7 5 11111'0



回答3:

To answer the question from the comments of the answer:

Hey Mark, thanks for your help, but I dont really understand how you got those codes? Do I have to change a lot in the code?<

Mark simply refers to the goal of the huffman coding to find the most efficient depth (number of bits) for each symbol so that the overall encoding of all symbols (frequency[symbol] * codeLenght[symbol] for all symbols) is minimized.

So actually all you need to do is to determine the depth (level) of the leaf for each symbol. Now sort this by the depth of each symbol and start to count them up.

Now you only need to count the pattern up. Simple as that:

Example:
2x2: 00, 01  (next is 10)
4x3: 10 + (00, 01, 10) = 1000, 1001, 1010 (next is 1011)
5x3: 1011 + (0, 1, 0 + 10) = 10110, 10111, 10110 + 10 = 11000 (next would be 11001)...

The last part shows what happens if the number of elements is bigger than the available bit difference between both groups. It just gets added to the prefix.

This way a Huffman code is created that uses the least amount of space. Since this is only one tree you can also start with 11111 and remove 1 and get another code system that is equally efficient in terms of number of bits.

One thing that can be added is that there are modification that increase the number of 1 (or 0) in comparism to 0 (or 1) so you have another chance of compressing those bit pattern used to compress a message even further.


Summary: Sort the symbols by its depth within the frequence tree. Construct the code by adding (substracting) by one. The next free code is the start prefix for the next group. For each new member just add (substract) up. And you start with filling 0 (1) within the code.

In order to store such a tree remember by using the same algorithm for the same symbole groups you need to store only the following informations:

Number of groups n, n * { +bitDepth, number of symbols i, s1,s2...si}. Since the storing of n, +bitDepth, number of symbols itself is a subject of compression you can either use a variable bit format or even issue a huffman tree since you will find stastically unevenly distributions which are the main reason you can see compression taking place with huffman trees.