Project Option: Compression with LZW and Huffman Encoding

As hard drives get bigger and cheaper, files sizes grow, and computer users get more lazy about cleaning their files up. Yes, no matter what happens we always want to keep more files on our hard drive.

Also, with the advent of the internet, we're often downloading files. For those of us with the slower modems, it's a real pain if files are large, because waiting is a hassle.

Probably most of us have used a file compression program like Zip, gzip (or GIF and JPEG for images) in order to fit more files in the available space.

There are two types of compression:

For this project, you will be trying out a common technique for lossless compression. Lossless compression can work on any file type, because it never loses any information. Plus, it has some pretty cool data structures.

Compression takes advantages of the fact that most files have tons of patterns in them: programs source files have keywords, variable names, and syntactic constructs repeated over and over again, and in English text the letter 'e' is far more common than the letter 'x'. Or another pattern: the letter 'q' is rarely followed by anything other than the letter 'u'.

Compression programs can exploit certain patterns by encoding those patterns efficiently. This approach won't work if the files are random, but how many people have files filled with truly random data? That'd be pretty useless.

LZW combined with Huffman Encoding is a lossless compression technique that exploits both repeated words, and also things that occur more frequently versus things that occur less frequently (the meaning of this sentence will probably take a while to get).

Yes, for this project, you'll be implementing this technique to compress files for the good of humanity.

A bit about LZW Compression

LZW compression is based on repeated sequences of letters (not necessarily words). It keeps a dictionary of recent sequences it's seen, and assigns them a small numeric code. Typically, LZW will store up to 4096 sequences, numbered 0-4095 (which takes 12 bits). If one of these sequences occurs again, it writes the 12-bit number instead of the whole word. In practice, this makes the file smaller.

Did you notice that it maintains "a dictionary of recent sequences"? You might be able to think of some data structures that would be good to use for this...

A bit about Huffman Encoding

Suppose you had a file like the following:
aaaaaaaaaaaaaaaaaTaaaaaaaaaaaaaaaaQaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaYaaaa

This file has lots of a's. If you wrote the file in ASCII, it'd take 8 bits per character. But, wouldn't it be nice if there was some way of writing a's with fewer bits - even if it meant that other characters would take more bits? There's so many a's that maybe this would result in a smaller overall file. Do you think?

David Huffman thought so. He designed a data structure called "Huffman Trees", which allow you to write frequently occurring letters with fewer bits, and will often save space. Huffman Encoding uses Huffman Trees to compress files.

With a bit of abstraction, you can use Huffman Encoding to improve the results of LZW compression (in fact, most people do this). Remember how LZW compression writes out 12-bit numbers? If some words repeat more than others, then some of these 12-bit numbers will be more frequent than others. Huffman Trees can exploit this fact, to make the files even smaller.

Detail on Huffman Encoding

Once you start looking at things, you'll see that there's "static" Huffman Encoding and "dynamic" or "adaptive" Huffman Encoding. For this project, you only need to do static encoding (which is probably easier). However, you certainly can do adaptive encoding.

Resources on LZW compression and Huffman Encoding

Our textbook doesn't have anything on LZW compression, but does have a discussion on how to build a Huffman Tree (pp. 395-401). Fortunately, there's also a number of well-written introductions to these compression techniques on the web:

Of course, the 326 staff is also a resource to help you understand these algorithms and data structures.

Assignment

For this project option, you will need to learn LZW compression and Huffman Encoding. Then, write a program that will process any file by first applying LZW, and then using Huffman Encoding on the LZW output. The program should work on any file, but as above, it won't be able compress every file.

You'll also need to write a companion file to decode the compressed program to get the original file.

You may use any priority queue or dictionary code you like (either your own previous code, new code, or code from a library), but the code for the compression algorithms themselves should be your own.

Try your compression program on a variety of files, and see how well it works. Try at least some GIF files. Why doesn't it work so well?

There may be some decisions you'll need to make. For example, in static Huffman Encoding, you need to store your Huffman Tree in a file. There are many different ways of storing the trees; how efficient is your method?

Also, you clearly want good performance for your LZW dictionary, but which is the best data structure for it? You'll need to come up with a good choice and implement that (note that, you're not allowed to say "Ummm, reducing coding time is a priority for me, so I'll use arrays/linked lists").

You may also think of tweaks to the algorithms that will make the encoder/decoder run faster, or make the compression better. If so, implement these ideas, and compare. But, this is not necessary.

Finally, one of the web resources is a link on bogus compression programs. Be prepared to show that your program is not an example of this. You may need to run the decoder program under the instructor's or a TA's computer account, to show that your program is really doing its job.