Assignment A4: Hashing High-Volume Data -- Part II (B) Applying Your HashTable

Part IIB Overview

In Part IIA, you implemented a custom hash table that used separate chaining as its collision resolution method. Now, in Part IIB, you will incorporate it into an existing image analysis application. In addition to the files you have already worked with in Part IIA, there are two new Java files to download. They are available in the A4PartIIB-starter.zip archive.

The effect of this activity is to "close the loop" on Assignment 4. You will be actually implementing the techniques you learned in Part I, and using this application scenario to do an informal study of the empirical properties of hash functions and their relation to collision behavior. In addition, you will be able to visually evaluate the information losses that can occur during data compression. Finally, you have an opportunity to design, implement and evaluate your own hash function for a new type of data: Block objects.

Perform the implementation described below. Turn in your Java source files and README.txt file using Catalyst CollectIt prior to the Part IIB deadline of Wednesday, February 17 at 11:59 PM. The standard late policy applies. (Note, however, that the deadline for this part was moved back a week, because it was released behind schedule.)

New Application Functionality

The Image Analyzer's application menu includes a variety of items, where nothing happens when the user selects them. Your job is to make the corresponding functionality actually work. Here are the items.

1. Create palette of size 2
2. Create palette of size 4
3. Create palette of size 16
4. Create palette of size 256

5. Encode (Slow and Simple)
6. Encode (Fast)

7. Decode [ PROVIDED FOR YOU ]

8. Set Block Size to 4x4x4
9. Set Block Size to 8x8x8
10. Set Block Size to 16x16x16

11. Select Hash Function H1
12. Select Hash Function H2
13. Select Hash Function H3

14. Use Custom Hashtable with separate chaining
15. Use Custom Hashtable with linear probing

Note that 15 is needed only if you do a certain extra credit option.

Application Starter Code

This code is similar in some ways to the A2 (Image Enhancer) starter code, but it is tailored to this assignment. There is no user-interfacing left to do. For example, each of the menu items you might need will be already created by the starter code. Stubs or partial implementations are provided for the methods you will need to write.

The starter code is a stripped-down application with the name "Image Analyzer": (ImageAnalyzer.java). You should not need to modify this file. One additional Java file is provided: IndexedColor.java, which is where you will add your own code to the application (other than your custom hash table files).

Several image files are provided. They are included in the zip archive.

Instrument the Code and Report Performance Statistics

Reporting of stats: Report the following at the end of the color counting step and again at the end of the encoding step. This can be done using System.out.println();

  which hash function is currently in use.
    (and if you are using anything other than
    separate chaining, without rehashing, report
    what hash table policies are in force.

  number of pixels in the image.
  number of distinct color bins (i.e., number of POSSIBLE distinct keys
      in the hashtable).
  tableSize of the hashtable.
  number of key-value pairs in the hashtable.
  load factor.

You will be testing how various hash functions perform in terms of collisions and execution time.

When executing operations using your custom hashtable, add the following to the printed statistics after palette construction.

  average number of collisions per insertion.
  maximum number of collisions for any one insertion.
  average number of collisions per get operation.

Implement 3 hash functions

H1: map block B = (Br, Bg, Bb) by
 exclusive or'ing Br, Bg, and Bb together.

H2: second hash function, h2() defined in Part I.

H3: map block B by converting B to a string
  and using Java's string hashcode method.
 (You may do this in your own way.  Different solutions
    should end up behaving differently.)

Implement the Popularity Algorithm

Using the current parameters for block size s and U set by the user via the menu, process the current image and build the color table. Then encode the pixel array getting an array (2-dimensional) of int values. At the end of this step, enable the Decode menu item.

Test Data

Four test images are provided here. They are included in the new zip file linked above. One is the UW-Campus-1961.jpg The other is the (very small) image q3Image.png that corresponds to the example you worked by hand in Part I. Another is an image of Mount Rainier at a somewhat low resolution, that should be loaded automatically when your program starts.

Extra Credit

(Option A4E2) (5 points). Implement a second inner class within CustomHashtables.java with the name LinearProbingHashTable. This kind of hash table should use "open addressing" rather than separate chaining, and it should resolve collisions using linear probing. If you don't implement rehashing for this hash table, then choose an initial size that is large enough to accommodate the data of the following test. Compare the performance of the two collision-resolution methods by running both collision methods on palette construction for the UW-Campus image that comes with the starter code. In the comparison, use a block size of 16x16x16 and a palette size of 16. Describe clearly in your README.txt file what you did (including table sizes and final load factors) and what the results were.

(Option A4E3) (5 points). Create an Excel or similar spreadsheet that provides the following data. For each image in {q3Image.png, Mt-Rainier.jpg, UW-Campus-1961.jpg}, and for each hash function in {h1, h2, h3}, using block size 16 and palette size 256, and the fast encoding method the following results: number of pixels, encoding time, and number of hash collisions. There should be nine entries of three values each (pixels, encoding time, collisions) in this spreadsheet table.

Then add another table to your spreadsheet that shows, for the Mt-Rainier image alone, and for a size-256 palette and block size 16, for each of the two encoding methods (slow-and-simple, fast), and for each of hash functions (h1, h2, and h3), the resulting encoding time and number of collisions. There should be six entries (of two values each) in this spreadsheet table. If you also implemented the LinearProbingHashTable, include six more results for it in this second table (or create a third table for it). Turn in only a PDF of this spreadsheet. Your tables must clearly label each row and column, and not just give the data.

README.txt File

Create a .txt file named README.txt. In it, include the following.

 Your name and UWNetID
 
 1. Which extra credit options you implemented, if any, in both Part IIA and Part IIB.

 2. Brief explanation of your hash function h3 for blocks and why you think
    it should do a good job in terms of (a) key scattering and (b) efficiency.

 3. An answer to the question: "When building a size-256 palette for the UW image
    and block size 16, which hash function was most efficient?"

 3. Answers to the questions: (3a) "Under what hash function, palette size, 
    block size, and image does the fast encoding method run at least a factor of 10
    faster than the slow-and-simple method?" (3b) "Why is it faster in such a case?"

 4. An answer to the question: "What kind of compression ratio can one
    expect from indexed color encoding, assuming one is willing to
    accept a slightly noticeable distortion in the colors of an image?"

 5. Answers to the questiona: (5a) "How does changing the block size affect the color
    quality of the decoded image using a size-16 palette and the Mt-Rainier.jpg
    image?" (5b) why?

Turn in your README.txt file with your source files, via Catalyst CollectIt, before the Part II deadline.

Turn-in

Turn in your source files through our Catalyst CollectIt dropbox. Part II of this assignment (both A and B) is due Wednesday, February 10 at 11:59 PM. Late penalties apply (see the policies page for details).

Late addition: For anyone wondering what a "compression ratio" is, here it is. When data, such as an image, is "compressed", the original version (typically in a file) is processed so that it can be represented with fewer bits of information. This reduction in size is why the transformation of the representation is called compression. Suppose the original version required 20 megabits of information to represent and the new version required only 10 megabits. Then the compression ratio is the fraction NEW / OLD where OLD is the number of bits used by the original (here 20,000,000) and NEW is the number of bits used by the new version (here 10,000,000). For this example, the compression ratio is 1/2.

In this assignment, you may compute the size of the original image and the size of the indexed color representation using the following guidelines. An array of RGB values can be assumed to require 3 N bytes, where N is the number of pixels. A byte is 8 bits. (In actuality it may be 4 N bytes, due to the alpha value in each pixel, but ignore this since we are not doing anything with the alpha value anyway.) For the indexed color representation the number of bits is the sum of the numbers of bits for the palette and for the array of indexes. The palette bytes (corrected from bits) are 3 M, where M is the number of palette entries. The array of indices takes N b, where N is the number of pixels and b is the number of bits per pixel, which should be log (base 2) of the palette size M.

Version 1.0 Released Saturday, Feb. 6 at 4:05 PM. "Compresion ratio" explanation added Feb. 15. Also, clarification of "POSSIBLE" distinct keys added Feb. 15. "Bits" corrected to "bytes" in discussion of palette memory requirements Feb. 17.