CSE 326 - Winter 2003

Assignment 4

Data Compression using Huffman Trees and Priority Queues

Assigned: Monday, Feb. 24
Electronic submission due Monday, March 3, by 11:00am. Hardcopy due in class March 3.



This is a group assignment. You can work in groups of 1 to 3. Read all the specifications carefully.

Most of us use utilities like Zip or gzip to compress data files. In this assignment, you will be learning one of the classic algorithms for data compression called "Huffman encoding." The assignment will make you familiar with priority queues and will let you apply data structures to solve practical problems.

Variable Length Codes

Suppose we want to compress a 100,000-byte data file that we know contains only the uppercase letters A through F. Since we have only six distinct characters to encode, we can represent each of them with three bits rather than the eight bits normally used to store characters. For example:
Letter     A    B    C    D    E    F  
Codeword   000  001  010  011  100  101  
The file size is 3*100,000 bits or 37,500 bytes. Can we do better? What if we knew the relative frequencies at which each letter occurred? It would be logical to assign shorter codes to the most frequent letters and save longer codes for the infrequent letters. For example, consider this code:
Letter         A    B    C    D    E     F  
Frequency(K)   45   13   12   16   9     5  
Codeword       0    101  100  111  1101  1100  
Using this code, our file can be represented with (45×1 + 13×3 + 12×3 + 16×3 + 9×4 + 5×4) × 1000 = 224,000 bits or 28,000 bytes. We have achieved 72% compression over the original file.

Prefix Codes

We consider here only codes in which no code is a prefix of some other codeword. Such codes are called prefix codes. Notice that in our variable-length code, no codeword is a prefix of any other codeword. For example, we have a codeword 0, so no other codeword starts with 0. And both of our four-bit codewords start with 110, which is not a codeword. Prefix codes are desirable because they simplify decoding. We simply can accumulate bits from a stream until we have completed a codeword. The decoding process needs a convenient representation for the prefix code so that the initial codeword can be easily picked off. A binary tree whose leaves are the given characters provides one such representation. We interpret the binary codeword for a character as the path from the root to that child, where 0 means "go to the left child" and 1 means "go to the right child". The trees for the two codes of our examples are shown below. Note that these are not binary search trees, since the leaves need not appear in sorted order, and that internal nodes do not contain character keys.
                 100                           100
             /         \                     /     \
            /           \                   /       \
           /             \                 /         \
          /               \               /           \
         86               14             A            55
       /    \            /              [45]      /       \
    58        28       14                       25        30
   /  \      /  \     /  \                     /  \      /  \
  A    B    C    D   E    F                   C    B   14    D
[45] [13] [12] [16] [9]  [5]                [12] [13] /  \  [16]
                                                     F    E
                                                    [5]  [9]
                                                                         
 Figure 1. Fixed Length Code            Figure 2. Variable length code

The leaves are labeled with their frequencies and the branches with the total frequencies of the leaves in their subtrees.

Huffman Algorithm

Huffman invented a simple algorithm for constructing such trees given the set of characters and their frequencies. The corresponding code is called the Huffman code, and is shown to be the optimal prefix code. Let us assume that we have a set C of n characters and that each character c in C has the frequency f[c].

The algorithm constructs the tree in a bottom-up way. It starts by creating a single-node tree for each character in C. At each step, it takes two trees with the smallest frequencies and merges them by making them children of a new root. This root is then labeled with the sum of the frequencies of its two children. This process is repeated |C|-1 times, after which we are left with only a single tree.

A priority queue Q, keyed on the frequencies f, is used to identify the two least-frequent trees at each step.

Huffman(C)
    n = |C|
    Q = C
    for i = 1 to n - 1
        z = new Node()
        z.left = Q.extractMin()
        z.right = Q.extractMin()
        z.f = z.left.f + z.right.f
        Q.insert(z)
    done

Your Task

Your task is to create a program Encode to compress a file using Huffman encoding. You should implement the following data structures: The program should do the following:
  1. Go through the input file and calculate the count of all the characters. You can assume that the file only contains characters from `a' to `z' (all in lower case), and you can use an array to store the counts.
  2. Initialize the priority queue by creating a TreeNode corresponding to each character.
  3. Create the Huffman tree using the algorithm described above.
  4. For each leaf in the resulting tree (corresponding to a character), calculate its codeword as described above, as a String of 0s and 1s
  5. Using the lengths of codewords and the counts of characters, calculate the percentage compression that Huffman encoding achieves over the original file.

The program should take a single argument, which is the name of the input file. It should be called as:

java Encode input.txt
The program should output:
  1. the Huffman code of each character as a sequence of 0s and 1s.
  2. percentage compression achieved by this code.
You can use the file input.txt to test your program.

As usual, you may not use existing data structure classes, except for built-in arrays.

Turn-in Instructions

This is a group assignment. You may work in a group of one to three people.

Turn in electronically all the relevant source files only. Follow the usual instructions. (The recommended printing command is shown on that page.)

Bring to class the printouts of the source files as well as the outputs of your program. Also, write down separately anything interesting about your implementation (especially if you did something for extra credit). Be sure to specify the members of your group (including your login names) and which of you did the electronic submission.