Project Option: Compression (lossy and non-lossy)

Overview

As hard drives get bigger and cheaper, files sizes grow, and computer users get more lazy about cleaning their files up. Yes, no matter what happens we always want to keep more files on our hard drive. Also, with the advent of the Internet, we're often downloading files. For those of us with slow modems, it's a real pain if files are large.

Probably most of us have used a file compression program like Zip, gzip (or GIF and JPEG for images and mpeg-variants (eg. DivX), quicktime for video) in order to fit more files in the available space. In this project, you will be implementing at least one of the types of compression listed above.

Background

There are two types of compression:

Assignment

For this project option, you will need to learn and implement Huffman encoding as well as at least one other compression technique. You will need to implement an encoder and a companion decoder for this project. Make sure that you specify what algorithms you will try when you submit your proposal. There is no restriction to staying with JPEG and LZW. Just make sure you write up what you are thinking in your proposal.

The main data structure that in the requirement is the Huffman tree. This is a binary tree which is built bottom up (kind of like an uptree for union-find). Information on the Huffman tree (as well as the LZW algorithm, JPEG and other compression stuff) can be found in the Resources section.

Before delving into writing your own hash tables, heaps, sort algorithms, or other such stuff, make sure you check your language's libraries (C++: STL, Java: JFC) for implementations that you can use. You may use any data structure you like (either your own previous code, new code, or code from a library), but the code for the compression algorithms themselves should be your own. Basically, use your common sense and you'll be fine.

There may be some decisions you'll need to make. For example, in static Huffman Encoding, you need to store your Huffman Tree in the resulting compressed file to be able to decode it later. There are many different ways of storing the trees; how efficient is your method? As there are many optimizations for these algorithms, you should try to make yours as fast as possible. Remember, algorithmic improvements will usually yield the largest speed-ups, but implementation smartness is also an issue (don't pass copies of huge data structures around all the time). Try to balance the two.

When you've gotten your program working, try your compression program on a variety of data, and see how well it works. Try at least some GIF files (in the case of JPEG, try compressing gifs that have been converted to bitmaps). How well does this work and why?

If you implement extensions or speed-ups, do a comparison of your program with and without your speed-up/extension. Be sure to document this stuff in your write-up.

Compression briefs

Loss-less compression algorithm generalities

Loss-less compression takes advantages of the fact that most files have tons of patterns in them: programs source files have keywords, variable names, and syntactic constructs repeated over and over again, and in English text the letter 'e' is far more common than the letter 'x'. Or another pattern: the letter 'q' is rarely followed by anything other than the letter 'u'.

Compression programs can exploit certain patterns by encoding those patterns efficiently. This approach won't work if the files are random, but how many people have files filled with truly random data? That'd be pretty useless.

A bit about Huffman Encoding

Suppose you had a file like the following:
aaaaaaaaaaaaaaaaaTaaaaaaaaaaaaaaaaQaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaYaaaa

This file has lots of a's. If you wrote the file in ASCII, it'd take 8 bits per character. But, wouldn't it be nice if there was some way of writing a's with fewer bits - even if it meant that other characters would take more bits? There's so many a's that maybe this would result in a smaller overall file. Do you think?

David Huffman thought so. He designed a data structure called "Huffman Trees", which allow you to write frequently occurring letters with fewer bits, and will often save space. Huffman Encoding uses Huffman Trees to compress files.

Once you start looking at things, you'll see that there's "static" Huffman Encoding and "dynamic" or "adaptive" Huffman Encoding. For this project, you only need to do static encoding (which is probably easier). However, you certainly can do adaptive encoding.

A bit about LZW Compression

LZW combined with Huffman Encoding is a loss-less compression technique that exploits both repeated words, and also things that occur more frequently versus things that occur less frequently (the meaning of this sentence will probably take a while to get).

LZW compression is based on repeated sequences of letters (not necessarily words). It keeps a dictionary of recent sequences it's seen, and assigns them a small numeric code. Typically, LZW will store up to 4096 sequences, numbered 0-4095 (which takes 12 bits). If one of these sequences occurs again, it writes the 12-bit number instead of the whole word. In practice, this makes the file smaller.

Did you notice that it maintains "a dictionary of recent sequences"? You might be able to think of some data structures that would be good to use for this...

With a bit of abstraction, you can use Huffman Encoding to improve the results of LZW compression (in fact, most people do this). Remember how LZW compression writes out 12-bit numbers? If some words repeat more than others, then some of these 12-bit numbers will be more frequent than others. Huffman Trees can exploit this fact, to make the files even smaller.

JPEG encoding

So it sounds like Huffman and LZW already do a fairly good job of doing the pattern matching thing. It would be hard to find any major improvements on these two schemes while maintaining their attributes of being loss-less. However, if we drop the requirement for loss-less compression you can get some big improvements (loss-less compression usually yields around 2:1 compression or less for binaries. JPEG can get up to 20:1 without much noticeable degradation in quality).

The basic idea behind JPEG is to make it easier on something like Huffman encoding to do a good compression by changing the input data such that it has more patterns. It does this by dropping information (oh no!) hence its lossy nature.

JPEG does its simplification by taking advantage of the way our eyes perceive images. First off, our eyes are much much more sensitive to changes in brightness than to changes in color. Thus, the first thing JPEG does is drop some of the color data. Then it realizes that neighboring pixels have some relation to one another. It takes the pixels of the image and modifies them by encoding them into some format where variance in the data can be reduced (this is done by a process called the Discrete Cosine Transform... think of it as a really fancy version of averaging) After this point, our data has been nice and polished and is ready to be further compressed by Huffman or its likes. Pretty SPIFFy eh? Mpeg works similarly, except it also optimizes for movement between frames.

Resources

This list will be growing as I find more resources so check back Our textbook doesn't have anything on LZW compression or JPEG, but does have a discussion on how to build a Huffman Tree (pp. 357-362 (Java), pp. 395-401 (C++)). Fortunately, there's also a number of well-written introductions to these compression techniques on the web:

Of course, the 326 staff is also a resource to help you understand these algorithms and data structures.

Extensions:

For extensions, you pretty much have a carte blache. Here are a few ideas. They are by no means the only things you can do as extensions.