Probably most of us have used a file compression program like Zip, gzip (or GIF and JPEG for images and mpeg-variants (eg. DivX), quicktime for video) in order to fit more files in the available space. In this project, you will be implementing at least one of the types of compression listed above.
But this is not true for other kinds of files. For example, the human eye cannot detect certain kinds of changes in images. So, you can lose some detail in an image - compressing it much more - and no-one will care. In some cases, you might notice it, but maybe the degradation in the image is not such a big deal.
The main data structure that in the requirement is the Huffman tree. This is a binary tree which is built bottom up (kind of like an uptree for union-find). Information on the Huffman tree (as well as the LZW algorithm, JPEG and other compression stuff) can be found in the Resources section.
Before delving into writing your own hash tables, heaps, sort algorithms, or other such stuff, make sure you check your language's libraries (C++: STL, Java: JFC) for implementations that you can use. You may use any data structure you like (either your own previous code, new code, or code from a library), but the code for the compression algorithms themselves should be your own. Basically, use your common sense and you'll be fine.
There may be some decisions you'll need to make. For example, in static Huffman Encoding, you need to store your Huffman Tree in the resulting compressed file to be able to decode it later. There are many different ways of storing the trees; how efficient is your method? As there are many optimizations for these algorithms, you should try to make yours as fast as possible. Remember, algorithmic improvements will usually yield the largest speed-ups, but implementation smartness is also an issue (don't pass copies of huge data structures around all the time). Try to balance the two.
When you've gotten your program working, try your compression program on a variety of data, and see how well it works. Try at least some GIF files (in the case of JPEG, try compressing gifs that have been converted to bitmaps). How well does this work and why?
If you implement extensions or speed-ups, do a comparison of your program with and without your speed-up/extension. Be sure to document this stuff in your write-up.
Loss-less compression takes advantages of the fact that most files have tons of patterns in them: programs source files have keywords, variable names, and syntactic constructs repeated over and over again, and in English text the letter 'e' is far more common than the letter 'x'. Or another pattern: the letter 'q' is rarely followed by anything other than the letter 'u'.
Compression programs can exploit certain patterns by encoding those patterns efficiently. This approach won't work if the files are random, but how many people have files filled with truly random data? That'd be pretty useless.
aaaaaaaaaaaaaaaaaTaaaaaaaaaaaaaaaaQaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaYaaaa
This file has lots of a's. If you wrote the file in ASCII, it'd take 8 bits per character. But, wouldn't it be nice if there was some way of writing a's with fewer bits - even if it meant that other characters would take more bits? There's so many a's that maybe this would result in a smaller overall file. Do you think?
David Huffman thought so. He designed a data structure called "Huffman Trees", which allow you to write frequently occurring letters with fewer bits, and will often save space. Huffman Encoding uses Huffman Trees to compress files.
Once you start looking at things, you'll see that there's "static" Huffman Encoding and "dynamic" or "adaptive" Huffman Encoding. For this project, you only need to do static encoding (which is probably easier). However, you certainly can do adaptive encoding.
LZW compression is based on repeated sequences of letters (not necessarily words). It keeps a dictionary of recent sequences it's seen, and assigns them a small numeric code. Typically, LZW will store up to 4096 sequences, numbered 0-4095 (which takes 12 bits). If one of these sequences occurs again, it writes the 12-bit number instead of the whole word. In practice, this makes the file smaller.
Did you notice that it maintains "a dictionary of recent sequences"? You might be able to think of some data structures that would be good to use for this...
With a bit of abstraction, you can use Huffman Encoding to improve the results of LZW compression (in fact, most people do this). Remember how LZW compression writes out 12-bit numbers? If some words repeat more than others, then some of these 12-bit numbers will be more frequent than others. Huffman Trees can exploit this fact, to make the files even smaller.
The basic idea behind JPEG is to make it easier on something like Huffman encoding to do a good compression by changing the input data such that it has more patterns. It does this by dropping information (oh no!) hence its lossy nature.
JPEG does its simplification by taking advantage of the way our eyes perceive images. First off, our eyes are much much more sensitive to changes in brightness than to changes in color. Thus, the first thing JPEG does is drop some of the color data. Then it realizes that neighboring pixels have some relation to one another. It takes the pixels of the image and modifies them by encoding them into some format where variance in the data can be reduced (this is done by a process called the Discrete Cosine Transform... think of it as a really fancy version of averaging) After this point, our data has been nice and polished and is ready to be further compressed by Huffman or its likes. Pretty SPIFFy eh? Mpeg works similarly, except it also optimizes for movement between frames.
As an aside, there's also a section on the ridiculous Patents on compression algorithms people have gotten. Some of these patents are essentially for taking an already known compression algorithm, and applying a different dictionary data structure (e.g. hash table instead of linear search).
Another aside is the section on Compressing totally random data. The FAQ shows why it's is impossible, and has an amusing story about people who claimed to be able to do this.
Related to the above is a section on Bogus compression programs. These are compression programs that don't even attempt to compress your files, but do certain tricks to make it look like they did. (For this reason your presentation will involve some testing to make sure you are not trying to pull one over your loveable staff.)
Of course, the 326 staff is also a resource to help you understand these algorithms and data structures.