Project Option: Compression (lossy and non-lossy)

Overview

As hard drives get bigger and cheaper, files sizes grow, and computer users get more lazy about cleaning their files up. Yes, no matter what happens we always want to keep more files on our hard drive. Also, with the advent of the Internet, we're often downloading files. For those of us with slow modems, it's a real pain if files are large.

Probably most of us have used a file compression program like Zip, gzip (or GIF and JPEG for images and mpeg-variants (eg. DivX), quicktime for video) in order to fit more files in the available space. In this project, you will be implementing at least one of the types of compression listed above.

Background

There are two types of compression:

Loss-less compression. Compresses files like documents, program source, executables, etc. When you compress a file and then decompress it, you recover the exact file you started with. But, for most files, the compressed version of the file will indeed be smaller than the original.
Lossy compression. With some files, it's critical that the file is preserved exactly. For example, if one byte changes in an executable program, the program might crash. Or how would you like it if one or two random letters changed in your essays?
But this is not true for other kinds of files. For example, the human eye cannot detect certain kinds of changes in images. So, you can lose some detail in an image - compressing it much more - and no-one will care. In some cases, you might notice it, but maybe the degradation in the image is not such a big deal.

Assignment

For this project option, you will need to learn and implement Huffman encoding as well as at least one other compression technique. You will need to implement an encoder and a companion decoder for this project. Make sure that you specify what algorithms you will try when you submit your proposal. There is no restriction to staying with JPEG and LZW. Just make sure you write up what you are thinking in your proposal.

The main data structure that in the requirement is the Huffman tree. This is a binary tree which is built bottom up (kind of like an uptree for union-find). Information on the Huffman tree (as well as the LZW algorithm, JPEG and other compression stuff) can be found in the Resources section.

Before delving into writing your own hash tables, heaps, sort algorithms, or other such stuff, make sure you check your language's libraries (C++: STL, Java: JFC) for implementations that you can use. You may use any data structure you like (either your own previous code, new code, or code from a library), but the code for the compression algorithms themselves should be your own. Basically, use your common sense and you'll be fine.

There may be some decisions you'll need to make. For example, in static Huffman Encoding, you need to store your Huffman Tree in the resulting compressed file to be able to decode it later. There are many different ways of storing the trees; how efficient is your method? As there are many optimizations for these algorithms, you should try to make yours as fast as possible. Remember, algorithmic improvements will usually yield the largest speed-ups, but implementation smartness is also an issue (don't pass copies of huge data structures around all the time). Try to balance the two.

When you've gotten your program working, try your compression program on a variety of data, and see how well it works. Try at least some GIF files (in the case of JPEG, try compressing gifs that have been converted to bitmaps). How well does this work and why?

If you implement extensions or speed-ups, do a comparison of your program with and without your speed-up/extension. Be sure to document this stuff in your write-up.

Compression briefs

Loss-less compression algorithm generalities

Loss-less compression takes advantages of the fact that most files have tons of patterns in them: programs source files have keywords, variable names, and syntactic constructs repeated over and over again, and in English text the letter 'e' is far more common than the letter 'x'. Or another pattern: the letter 'q' is rarely followed by anything other than the letter 'u'.

Compression programs can exploit certain patterns by encoding those patterns efficiently. This approach won't work if the files are random, but how many people have files filled with truly random data? That'd be pretty useless.

A bit about Huffman Encoding

Suppose you had a file like the following:

aaaaaaaaaaaaaaaaaTaaaaaaaaaaaaaaaaQaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaYaaaa

This file has lots of a's. If you wrote the file in ASCII, it'd take 8 bits per character. But, wouldn't it be nice if there was some way of writing a's with fewer bits - even if it meant that other characters would take more bits? There's so many a's that maybe this would result in a smaller overall file. Do you think?

David Huffman thought so. He designed a data structure called "Huffman Trees", which allow you to write frequently occurring letters with fewer bits, and will often save space. Huffman Encoding uses Huffman Trees to compress files.

Once you start looking at things, you'll see that there's "static" Huffman Encoding and "dynamic" or "adaptive" Huffman Encoding. For this project, you only need to do static encoding (which is probably easier). However, you certainly can do adaptive encoding.

A bit about LZW Compression

LZW combined with Huffman Encoding is a loss-less compression technique that exploits both repeated words, and also things that occur more frequently versus things that occur less frequently (the meaning of this sentence will probably take a while to get).

LZW compression is based on repeated sequences of letters (not necessarily words). It keeps a dictionary of recent sequences it's seen, and assigns them a small numeric code. Typically, LZW will store up to 4096 sequences, numbered 0-4095 (which takes 12 bits). If one of these sequences occurs again, it writes the 12-bit number instead of the whole word. In practice, this makes the file smaller.

Did you notice that it maintains "a dictionary of recent sequences"? You might be able to think of some data structures that would be good to use for this...

With a bit of abstraction, you can use Huffman Encoding to improve the results of LZW compression (in fact, most people do this). Remember how LZW compression writes out 12-bit numbers? If some words repeat more than others, then some of these 12-bit numbers will be more frequent than others. Huffman Trees can exploit this fact, to make the files even smaller.

JPEG encoding

So it sounds like Huffman and LZW already do a fairly good job of doing the pattern matching thing. It would be hard to find any major improvements on these two schemes while maintaining their attributes of being loss-less. However, if we drop the requirement for loss-less compression you can get some big improvements (loss-less compression usually yields around 2:1 compression or less for binaries. JPEG can get up to 20:1 without much noticeable degradation in quality).

The basic idea behind JPEG is to make it easier on something like Huffman encoding to do a good compression by changing the input data such that it has more patterns. It does this by dropping information (oh no!) hence its lossy nature.

JPEG does its simplification by taking advantage of the way our eyes perceive images. First off, our eyes are much much more sensitive to changes in brightness than to changes in color. Thus, the first thing JPEG does is drop some of the color data. Then it realizes that neighboring pixels have some relation to one another. It takes the pixels of the image and modifies them by encoding them into some format where variance in the data can be reduced (this is done by a process called the Discrete Cosine Transform... think of it as a really fancy version of averaging) After this point, our data has been nice and polished and is ready to be further compressed by Huffman or its likes. Pretty SPIFFy eh? Mpeg works similarly, except it also optimizes for movement between frames.

Resources

This list will be growing as I find more resources so check back Our textbook doesn't have anything on LZW compression or JPEG, but does have a discussion on how to build a Huffman Tree (pp. 357-362 (Java), pp. 395-401 (C++)). Fortunately, there's also a number of well-written introductions to these compression techniques on the web:

The compression course given by Professor Richard Ladner last year
The 490gz homepage
A site that has info on LZW and Huffman Encoding, plus links:
The Squeeze Page (where'd that name come from?)
An article on LZW that deals with some non-obvious problems you might have
LZW Data Compression by Mark Nelson from Dr. Dobb's Journal.
The homepage for JPEG. It has a lot of good links (including that stupid SPIFF fileformat that I spent the last 2 days looking for...doh)
www.jpeg.org
The JPEG FAQ root -- lots of info on JPEG
JPEG compression
The FAQ for the comp.compression newsgroup - one of the best FAQs for a newsgroup, I think. Here's the root of the FAQ . Here's the discussion on Huffman Encoding, LZW, JPEG (there's also info on other technique like arithmetic encoding, and RLE)
As an aside, there's also a section on the ridiculous Patents on compression algorithms people have gotten. Some of these patents are essentially for taking an already known compression algorithm, and applying a different dictionary data structure (e.g. hash table instead of linear search).
Another aside is the section on Compressing totally random data. The FAQ shows why it's is impossible, and has an amusing story about people who claimed to be able to do this.
Related to the above is a section on Bogus compression programs. These are compression programs that don't even attempt to compress your files, but do certain tricks to make it look like they did. (For this reason your presentation will involve some testing to make sure you are not trying to pull one over your loveable staff.)

Of course, the 326 staff is also a resource to help you understand these algorithms and data structures.

Extensions:

For extensions, you pretty much have a carte blache. Here are a few ideas. They are by no means the only things you can do as extensions.

Arithmetic encoding -- this is an alternative to Huffman encoding. It is supposed to systematically yield 10% better compression than Huffman.
Mpeg -- Make your compression take into account temporal information to improve the compression ratio. This can yield up to 3x improvement over JPEG for video stream. (This is likely to be hard)
Huffman variant -- Do something like adaptive Huffman or progressive Huffman encoding
Design your own encoding scheme that somehow improves the compression in some interesting way. Talk this one over with staff before doing it. We won't give credit for trivial extensions so it would be better to make sure what you are thinking is going to get you credit.