Also, with the advent of the internet, we're often downloading files. For those of us with the slower modems, it's a real pain if files are large, because waiting is a hassle.
Probably most of us have used a file compression program like Zip, gzip (or GIF and JPEG for images) in order to fit more files in the available space.
There are two types of compression:
But this is not true for other kinds of files. For example, the human eye cannot detect certain kinds of changes in images. So, you can lose some detail in an image - compressing it much more - and no-one will care. In some cases, you might notice it, but maybe the degradation in the image is not such a big deal.
For this project, you will be trying out a common technique for lossless compression. Lossless compression can work on any file type, because it never loses any information. Plus, it has some pretty cool data structures.
Compression takes advantages of the fact that most files have tons of patterns in them: programs source files have keywords, variable names, and syntactic constructs repeated over and over again, and in English text the letter 'e' is far more common than the letter 'x'. Or another pattern: the letter 'q' is rarely followed by anything other than the letter 'u'.
Compression programs can exploit certain patterns by encoding those patterns efficiently. This approach won't work if the files are random, but how many people have files filled with truly random data? That'd be pretty useless.
LZW combined with Huffman Encoding is a lossless compression technique that exploits both repeated words, and also things that occur more frequently versus things that occur less frequently (the meaning of this sentence will probably take a while to get).
Yes, for this project, you'll be implementing this technique to compress files for the good of humanity.
Did you notice that it maintains "a dictionary of recent sequences"? You might be able to think of some data structures that would be good to use for this...
aaaaaaaaaaaaaaaaaTaaaaaaaaaaaaaaaaQaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaYaaaa
This file has lots of a's. If you wrote the file in ASCII, it'd take 8 bits per character. But, wouldn't it be nice if there was some way of writing a's with fewer bits - even if it meant that other characters would take more bits? There's so many a's that maybe this would result in a smaller overall file. Do you think?
David Huffman thought so. He designed a data structure called "Huffman Trees", which allow you to write frequently occurring letters with fewer bits, and will often save space. Huffman Encoding uses Huffman Trees to compress files.
With a bit of abstraction, you can use Huffman Encoding to improve the results of LZW compression (in fact, most people do this). Remember how LZW compression writes out 12-bit numbers? If some words repeat more than others, then some of these 12-bit numbers will be more frequent than others. Huffman Trees can exploit this fact, to make the files even smaller.
The Squeeze Page (where'd that name come from?)
LZW Data Compression by Mark Nelson from Dr. Dobb's Journal.
Here's the
root of the FAQ.
Here's the
discussion on LZW and Huffman Encoding
(there's also info on other techniques, where aren't necessary).
As an aside, there's also a section on the ridiculous
Patents
on compression algorithms people have gotten.
Some of these patents are essentially for
taking an already known compression algorithm, and applying a different
dictionary data structure (e.g. hash table instead of linear search).
This suggests that probably many of your answers to question #5 (Darth Wolfman
& his printers) on the midterm are patentable.
Another aside is the section on
Compressing
totally random data. The FAQ shows why it's is impossible, and has an
amusing story about people who claimed to be able to do this.
Related to the above is a section on
Bogus
compression programs. These are compression programs that don't
even attempt to compress your files, but do certain tricks to make it look
like they did.
(For this reason your presentation will involve some testing to make
sure you are not trying to pull one over on Darth Wolfman. Do not underestimate
the power of the dark side.)
Of course, the 326 staff is also a resource to help you understand these algorithms and data structures.
You'll also need to write a companion file to decode the compressed program to get the original file.
You may use any priority queue or dictionary code you like (either your own previous code, new code, or code from a library), but the code for the compression algorithms themselves should be your own.
Try your compression program on a variety of files, and see how well it works. Try at least some GIF files. Why doesn't it work so well?
There may be some decisions you'll need to make. For example, in static Huffman Encoding, you need to store your Huffman Tree in a file. There are many different ways of storing the trees; how efficient is your method?
Also, you clearly want good performance for your LZW dictionary, but which is the best data structure for it? You'll need to come up with a good choice and implement that (note that, you're not allowed to say "Ummm, reducing coding time is a priority for me, so I'll use arrays/linked lists").
You may also think of tweaks to the algorithms that will make the encoder/decoder run faster, or make the compression better. If so, implement these ideas, and compare. But, this is not necessary.
Finally, one of the web resources is a link on bogus compression programs. Be prepared to show that your program is not an example of this. You may need to run the decoder program under the instructor's or a TA's computer account, to show that your program is really doing its job.