CSE143 Notes for Wednesday, 5/25/11

I used the lecture time to describe the next programming assignment. We are going to write a program that compresses a text file by creating something known as a "Huffman tree."

We are exploring a technique known as "compression" that involves storing a file in a special format that allows it to take up less space on disk. Programs like winzip use sophisticated compression algorithms to do this. We are going to examine a basic form of this algorithm that can be implemented with binary trees.

Normally characters are stored as a sequence of bits of a fixed length. One such scheme is known as ASCII:

        A merican
        S tandard
        C ode for
        I nformation
        I nterchange

The original ASCII character set had a total of 128 characters that could be stored in 7 bits. The eighth bit was often used to indicate "parity" (odd or even), although this so-called parity bit often turned out to be more trouble than it was worth. Later we found ourselves wanting more than the 128 standard characters and that led to something known as extended ASCII which has 256 characters.

The nice thing about extended ASCII is that it fits nicely in 8 bits (what is known as a byte). The different integers we can form with one byte range from 00000000 to 11111111 in binary (which is 0 to 255 in base 10). So with one byte we can store 256 different sequences.

Most simple text files are stored this way, as a sequence of bytes each representing one character. To compress such a file, we need to come up with a different encoding scheme. The key idea is to abandon the requirement that the number of bits be a fixed number like 8. Instead we allow ourselves to have variable length codes. That way, we can use short codes for characters that occur often and we can have long codes for characters that appear less frequently.

The Huffman algorithm is a particular approach to finding such an encoding. We construct a binary tree that indicates how each different character is to be encoded. The particular tree we build will depend on the frequency of each character in the file we are trying to compress. So in the first part of this two-part assignment, the HuffmanTree constructor is passed an array of character frequencies.

First you construct a leaf node for each character with a non-zero frequency (we don't need codes for the other characters since they don't appear in the file). This gives us a list of leaf nodes with different frequencies. We now pick the two with lowest frequency and combine them into a new subtree whose frequency is the sum of the frequencies of the two we are combining. Once you make that subtree, you put it back into the list.

This process is repeated until you get down to one tree. Each time we remove two, combine them, and put the new subtree back into the list. That means that each time we get one closer to having a single tree.

Once the process is complete, we have the root of our HuffmanTree. We assign character codes by thinking of each left branch as a 0 and each right branch as a 1. The leaves of the tree each contain the information for a single character. The path from the root to the leaf tells us what code to use for that character.

I went through a detailed example that I won't reproduce here because there is a similar one in the assignment writeup.

In the first part of the assignment, you are responsible for building up a Huffman tree given an array of frequencies and printing out the codes for each character in the tree. In the second part of the assignment, you have to reconstruct the tree from the code file. For this second part of the assignment, the frequencies don't matter. The frequencies are only used in constructing the tree. That's why the instructions say for the second part that you can use frequencies like 0 or -1 when you reconstruct the nodes.

One problem we run into is trying to create compact files of 0's and 1's. If we write them as characters, they will be written in the standard 8-bit format. Since the Huffman algorithm gets about a 50% reduction at best, that's not going to work very well because we'd have a multiplier of 0.5 from the compression and a multiplier of 8 because we are storing each bit as an 8-bit character. That means we'd turn a file of n characters into a file of 0.5 * 8 * n, or 4n characters. In other words, our compression would quadrulple the size of the file, which isn't very impressive.

To solve this problem, I have written two classes called BitOutputStream and BitInputStream that write and read a series of bits in a compact manner. The Encode program uses BitOutputStream to produce the encoded binary file. The Decode program opens this file as a BitInputStream and passes it to your HuffmanTree to have it do the actual decoding. These classes are truly minimal classes that have only three public methods each. BitOutputStream has a constructor, a method called writeBit and a method called close. BitInputStream has a constructor, a method called readBit and a method called close.

The only method you'll have to worry about is the readBit method of the BitInputStream class. The Decode program constructs the BitInputStream and also closes it. It passes it to your HuffmanTree in between when it calls a method to decode the file.

The operation you perform repeatedly is to go to the top of your tree and to read bits from the input file, going left or right in the tree depending upon whether you see a 0 or 1 in the input stream. When you hit a leaf, you know that you've found the next character from the original file and you write it to the PrintStream object you've been passed. Then you go back to the top of the tree and descend again until you hit a leaf and you print that character. Then go back to the top of the tree and start all over.

I then talked about a subtle point for the assignment. When we go to use these codes to compress a file, we have to write a series of bits to an output stream in a compact format. I have written a class called BitOutputStream that does so. It has a significant limitation. The number of bits it writes will always be a multiple of 8. For example, suppose that you write a total of 8005 bits to one of these output streams. The actual number of bits written will be 8008. Your output will be "padded" with three 0's at the end. That's because the underlying input/output mechanisms are all based on bytes. You can't write part of a byte to a file.

This limitation of BitOutputStream causes a potential problem for our compression algorithm. Consider that case where we had written 8005 bits to the output stream. When we read it back in, we'll get those 8005 bits plus we'll get 3 extra 0's at the end. What if the code "0" represents a letter like "e"? Then those 3 extra 0's will look like 3 e's.

To get around this problem, we introduce a "fake" character that we refer to as the pseudo-eof character. We make up a character that doesn't actually exist and we write it to the output stream after the actual characters. That way, when we read the file back in, we'll know when to stop reading. That means that the "multiple of 8" limitation of BitOutputStream won't be a problem for us because we have a special signal to let us know when to stop reading the file.

For our purposes, we'll use an integer value one higher than the highest character code we've been asked to work with. In our case, we're dealing with character codes 0 through 255, so we'll use 256 as the code for the pseudo-eof character. You shouldn't include the actual value 256 in your code. Your code should be flexible enough that we could use a different maximum value. You can use the array length to determine this maximum value.

For the first part of the assignment, the only place this enters into things is that you have to manually add this character to the initial set of leaves for the Huffman algorithm. You're given the frequencies of each of the real characters from the input file and you will make a leaf node for each character with a nonzero count. You should also make a leaf node for the pseudo-eof character and give it a frequency of 1 since it will appear exactly once at the end of the file.

In decoding the file, you'll have to know when to stop processing. That's where the pseudo-eof character comes in. The Encode program writes the characters of the original file to the bit stream and then it writes the code for the pseudo-eof character. So as you are processing characters, eventually you will come across this eof character. When you do, you should stop decoding. You should not write this character to the PrintStream because it is not an actual character from the original file. It's a fictitious character that we made up to signal the end of the input.

Then I switched to the computer. I typed in the data for the example we worked out in lecture on the overhead and showed that it constructed the same code.

Then I spent a few minutes demonstrating the execution of the various programs involved in the next programming assignnment. I mentioned that I have a short data file called short.txt and a long data file called hamlet.txt (the full text of the play). I opened a terminal window on my Macintosh so that we could give commands to see information about files being created by the programs. On my Mac, I gave commands like this:

        ls -l hamlet.*

You can do the same on a Windows machine by opening a command window and using the dir command:

        dir hamlet.*

We ran each program using hamlet.txt:

Using MakeCode program and hamlet.txt, we made a file called hamlet.code that contained an encoding scheme for the file.
Then we ran the Encode program that took hamlet.txt and hamlet.code to produce a compressed file called hamlet.short. This is a binary file like a zip file, meaning that it is stored with a different encoding scheme than the normal ASCII text files that appear on the system. When we examined the contents of the file, it looked like gibberish.
Then we ran the Decode program that took hamlet.short and hamlet.code to produce a new file that we called hamlet.new. This is supposed to be the "inflated" version of the binary file, which means it should be exactly the same as hamlet.txt. Using the "ls -l" command on my Mac, we could see that it had the same byte length as the original. We also examined it to see that it looked just like the original.

We found that it reduced the file size from around 200 thousand characters to around 110 thousand characters (roughly cutting the file size in half). This isn't as good as the standard zip compression, but it's pretty good given the fact that the algorithm is relatively simple.

Stuart Reges

Last modified: Wed May 25 15:37:34 PDT 2011