CSE143 Notes for Friday, 11/21/08

I used the lecture time to describe the next programming assignment. We are going to write a program that compresses a text file by creating something known as a "Huffman tree."

We are exploring a technique known as "compression" that involves storing a file in a special format that allows it to take up less space on disk. Programs like winzip use sophisticated compression algorithms to do this. We are going to examine a basic form of this algorithm that can be implemented with binary trees.

Normally characters are stored as a sequence of bits of a fixed length. One such scheme is known as ASCII:

        A merican
        S tandard
        C ode for
        I nformation
        I nterchange
The original ASCII character set had a total of 128 characters that could be stored in 7 bits. The eighth bit was often used to indicate "parity" (odd or even), although this so-called parity bit often turned out to be more trouble than it was worth. Later we found ourselves wanting more than the 128 standard characters and that led to something known as extended ASCII which has 256 characters.

The nice thing about extended ASCII is that it fits nicely in 8 bits (what is known as a byte). The different integers we can form with one byte range from 00000000 to 11111111 in binary (which is 0 to 255 in base 10). So with one byte we can store 256 different sequences.

Most simple text files are stored this way, as a sequence of bytes each representing one character. To compress such a file, we need to come up with a different encoding scheme. The key idea is to abandon the requirement that the number of bits be a fixed number like 8. Instead we allow ourselves to have variable length codes. That way, we can use short codes for characters that occur often and we can have long codes for characters that appear less frequently.

The Huffman algorithm is a particular approach to finding such an encoding. We construct a binary tree that indicates how each different character is to be encoded. The particular tree we build will depend on the frequency of each character in the file we are trying to compress. So in this assignment, the HuffmanTree constructor is passed a map of character frequencies.

First you construct a leaf node for each character with a non-zero frequency (we don't need codes for the other characters since they don't appear in the file). This gives us a list of leaf nodes with different frequencies. We now pick the two with lowest frequency and combine them into a new subtree whose frequency is the sum of the frequencies of the two we are combining. Once you make that subtree, you put it back into the list.

This process is repeated until you get down to one tree. Each time we remove two, combine them, and put the new subtree back into the list. That means that each time we get one closer to having a single tree.

Once the process is complete, we have the root of our HuffmanTree. We assign character codes by thinking of each left branch as a 0 and each right branch as a 1. The leaves of the tree each contain the information for a single character. The path from the root to the leaf tells us what code to use for that character.

I went through a detailed example that I won't reproduce here because there is a similar one in the assignment writeup.

In this assignment, you are responsible for building up a Huffman tree given an array of frequencies and creating a map of binary codes for each character in the tree. You also have to use this map to compress and decompress files. For compressing and decompressing, the frequencies don't matter; only the binary codes do. The frequencies are only used in constructing the tree.

One problem we run into is trying to create compact files of 0's and 1's. If we write them as characters, they will be written in the standard 8-bit format. Since the Huffman algorithm gets about a 50% reduction at best, that's not going to work very well because we'd have a multiplier of 0.5 from the compression and a multiplier of 8 because we are storing each bit as an 8-bit character. That means we'd turn a file of n characters into a file of 0.5 * 8 * n, or 4n characters. In other words, our compression would quadrulple the size of the file, which isn't very impressive.

To solve this problem, we have written two classes called BitOutputStream and BitInputStream that write and read a series of bits in a compact manner. The Encode program uses BitOutputStream to produce the encoded binary file. The Decode program opens this file as a BitInputStream and passes it to your HuffmanTree to have it do the actual decoding. BitOutputStream has a constructor, methods called writeBit and writeBits and a method called close. BitInputStream has a constructor, a method called readBit and a method called close.

The client program constructs BitInputStream BitOutputStream objects and passes them to your Huffman tree to compress and decompress files.

The operation you perform repeatedly is to go to the top of your tree and to read bits from the input file, going left or right in the tree depending upon whether you see a 0 or 1 in the input stream. When you hit a leaf, you know that you've found the next character from the original file and you write it to the output stream you've been passed. Then you go back to the top of the tree and descend again until you hit a leaf and you print that character. Then go back to the top of the tree and start all over.


Stuart Reges
Last modified: Sat Nov 22 10:42:10 PST 2008