CSE 326 - Winter 2003
Assignment 4
Data Compression using Huffman Trees and Priority Queues
Assigned: Monday, Feb. 24
Electronic submission due Monday, March 3, by 11:00am. Hardcopy due in class March 3.
This is a group assignment. You can work in groups of 1 to 3.
Read all the specifications carefully.
Most of us use utilities like Zip or gzip to compress data files. In
this assignment, you will be learning one of the classic algorithms for
data compression called "Huffman encoding." The assignment will make you
familiar with priority queues and will let you apply data structures to solve
practical problems.
Variable Length Codes
Suppose we want to compress a 100,000-byte data file that we know
contains only the uppercase letters A through F. Since we have only
six distinct characters to encode, we can represent each of them with
three bits rather than the eight bits normally used to store
characters. For example:
Letter A B C D E F
Codeword 000 001 010 011 100 101
The file size is 3*100,000 bits or 37,500 bytes. Can we do better?
What if we knew the relative frequencies at which each letter
occurred? It would be logical to assign shorter codes to the most
frequent letters and save longer codes for the infrequent letters. For
example, consider this code:
Letter A B C D E F
Frequency(K) 45 13 12 16 9 5
Codeword 0 101 100 111 1101 1100
Using this code, our file can be represented with (45×1 + 13×3 + 12×3
+ 16×3 + 9×4 + 5×4) × 1000 = 224,000 bits or 28,000 bytes. We have
achieved 72% compression over the original file.
Prefix Codes
We consider here only codes in which no code is a prefix of some other
codeword. Such codes are called prefix codes. Notice that in
our variable-length code, no codeword is a prefix of any other
codeword. For example, we have a codeword 0, so no other codeword
starts with 0. And both of our four-bit codewords start with 110,
which is not a codeword. Prefix codes are desirable because they
simplify decoding. We simply can accumulate bits from a stream
until we have completed a codeword.
The decoding process needs a convenient representation for the prefix
code so that the initial codeword can be easily picked off. A binary
tree whose leaves are the given characters provides one such
representation. We interpret the binary codeword for a character as
the path from the root to that child, where 0 means "go to the left
child" and 1 means "go to the right child". The trees for the two
codes of our examples are shown below. Note that these are not binary
search trees, since the leaves need not appear in sorted order, and that
internal nodes do not contain character keys.
100 100
/ \ / \
/ \ / \
/ \ / \
/ \ / \
86 14 A 55
/ \ / [45] / \
58 28 14 25 30
/ \ / \ / \ / \ / \
A B C D E F C B 14 D
[45] [13] [12] [16] [9] [5] [12] [13] / \ [16]
F E
[5] [9]
Figure 1. Fixed Length Code Figure 2. Variable length code
The leaves are labeled with their frequencies and the branches with the
total frequencies of the leaves in their subtrees.
Huffman Algorithm
Huffman invented a simple algorithm for constructing such trees given
the set of characters and their frequencies. The corresponding code is
called the Huffman code, and is shown to be the optimal prefix
code.
Let us assume that we have a set C of n characters and that each
character c in C has the frequency f[c].
The algorithm constructs the tree in a bottom-up way. It starts by
creating a single-node tree for each character in C. At each step, it
takes two trees with the smallest frequencies and merges them by making them
children of a new root. This root is then labeled with the sum of the
frequencies of its two children. This process is repeated |C|-1 times,
after which we are left with only a single tree.
A priority queue Q, keyed on the frequencies f, is used to identify the two
least-frequent trees at each step.
Huffman(C)
n = |C|
Q = C
for i = 1 to n - 1
z = new Node()
z.left = Q.extractMin()
z.right = Q.extractMin()
z.f = z.left.f + z.right.f
Q.insert(z)
done
Your Task
Your task is to create a program Encode to compress a file using
Huffman encoding.
You should implement the following data structures:
- A TreeNode class to represent a node in the Huffman tree.
- A priority queue to store elements of TreeNode class with the
frequency of TreeNode as the key. You may use a binary heap as your priority queue.
The program should do the following:
- Go through the input file and calculate the count of all the
characters. You can assume that the file only contains characters
from `a' to `z' (all in lower case), and you can use an array to store
the counts.
- Initialize the priority queue by creating a TreeNode corresponding to
each character.
- Create the Huffman tree using the algorithm described above.
- For each leaf in the resulting tree (corresponding to a character),
calculate its codeword as described above, as a String of 0s and 1s
- Using the lengths of codewords and the counts of characters,
calculate the percentage compression that Huffman encoding achieves
over the original file.
The program should take a single argument, which is the name of the
input file. It should be called as:
java Encode input.txt
The program should output:
- the Huffman code of each character as a sequence of 0s and 1s.
- percentage compression achieved by this code.
You can use the file input.txt to test your
program.
As usual, you may not use existing data structure classes, except for
built-in arrays.
Turn-in Instructions
This is a group assignment. You may work in a group of one to three people.
Turn in electronically all the relevant source files only. Follow
the usual instructions. (The recommended printing
command is shown on that page.)
Bring to class the printouts of the source files as well as the outputs of your program. Also, write down separately
anything interesting about your implementation (especially if you did
something for extra credit). Be sure to specify the members of your
group (including your login names) and which of you did the electronic
submission.