CSE 373, Autumn, 2016 -- University of Washington |
In this pair of assignments you'll implement a kind of hash table and use it in an application: to compare images with each other for similarity of color usage. You'll do some experimenting with two different hashing collision-resolution policies: linear probing and quadratic probing. So the high-level idea here is "Comparing Composite Objects."
One approach to comparing any kinds of documents is counting the the occurrences of different elements they use and then comparing the counts from one document to another. For example, when comparing written essays, you can count how many times the word "therefore" is used in each essay and perhaps conclude that essay B is more logical in style than essay A because essay B uses the word "therefore" 10 times, whereas essay A uses it once or not at all.
For comparing paintings, we can count how many pixels have a particular color in a digitized version of each painting. The painting that has relatively more bright red pixels might be considered to be more lively or otherwise distinguished from the other. We will use a technique that considers all the colors, rather than any particular color like red by itself.
We will use hash tables to count the number of pixels of each color. Each dictionary entry is a pair, with a "ColorKey" as the key, and a nonnegative integer (a count) as the value. Since the space of possible colors can be large but the number of colors actually used tends to be considerably smaller, hash tables are one natural choice for this problem.
When counting colors, there is an interesting subtlety. Two similar colors, when represented with many bits of precision, may appear identical to the human eye but be off by just a little, so that the computer considers them as distinct. This can make comparing colors problematic. One way to address this is to map each color into a smaller-dimensional color space. How small should that color space be? This can be considered a research question, or if this program is to be an "app" then it is a research and development question. You'll try out multiple "ColorKey" spaces for the hashing, and this will have an effect on what the computer comes up with for its measure of similarity between images.
MonaLisa.jpg | StarryNight.jpg | ChristinasWorld.jpg |
Just as in Assignment 1, you will be using arrays to represent a more advanced abstract data type. Instead of stacks, however, you will be implementing hash tables.
The hash tables will have the usual functionality for inserting and retrieving key-value associations. However, the actual API will be a little unusual, with some extra methods that can help in debugging, automatic testing, and automatic grading.
By doing this assignment, you will learn how to implement a hash table, including collision resolution, and you will see how it can be used in an application such as comparing objects such as images or documents.
You are encouraged, but not required, to work in a partnership of two students for these two assignments. We expect that you will be in the same partnership in both assignments. However, if you feel it is necessary to change partnerships from Assignment 3 to Assignment 4, that will also be fine.
Start with the "CSE-373 A3 Starter" zip file. Create a Java project in Eclipse with the name CSE-373-A3-Hashing. Put the source files into the default package of the src folder. Put the image files into the project folder but not in the src folder.
We are giving you certain files. Add code to those which are not complete. Then test your program as specified. Here are the names of the files and what they are for.
You may implement any additional methods you like to serve as helper methods or debugging aids. Provide comments for each of these additional methods.
The file ComparePaintings.java will be your main application file. It already has some code in it.
There is a separate file FeatureVector.java with a class FeatureVector which has a constructor. You should add the implementation of the two incomplete methods there -- one is for getting the counts of the colors from a hash table. The other is computing the "cosine similarity" between a pair of feature vectors. This is a standard technique in information retrieval and document analysis, and it is often used for comparing web pages.
The ComparePaintings.java file implements a class whose instances handle analysis of one or more paintings. A basic constructor method is already provided. You should implement, for class ComparePaintings the following methods:
Bits Per Pixel C(Mona,linear) C(Mona,quadratic) C(Starry,linear) C(Starry,quadratic) C(Christina,linear) C(Christina,quadratic) 24 21 18 15 12 9 6 3In order to determine the number of collisions, your method countColors should keep a running total of the number of collisions as the pixel colors are counted. The number of collisions for each get and put operation are to be communicated from the ColorHash class back to the caller via the ResponseItem objects. For purposes of this assignment, we will define a collision to occur when either a put operation or a get operation arrives via the hash function or via collision resolution at a location that is occupied and that has the wrong key. (A put operation arriving at a location with the same key is an update operation and this is not a collision.) You may implement the logic for totaling up the numbers of collisions in either FeatureVector.java, ComparePaintings.java, or both. (late addition to the spec: For the collisionTests method, use an initial tableSize of 3 and a rehashLoadFactor value of 0.5.)
Bits Per Pixel S(Mona,Starry) S(Mona,Christina) S(Starry,Christina) 24 21 18 15 12 9 6 3As bits per pixel go down, the S values for S(x,y) can be expected to rise towards 1.0.
There are several different interpretations of "vector". For example, a vector is a sequence of numbers. An n-dimensional vector has n numbers. A vector can also be thought of as an object that has direction and magnitude. In physics, a vector may be used to represent velocity of motion; the direction represents which way the object in motion is going, and the magnitude tells how fast it is going.
A feature vector is a sequence of numbers that describes the features of some object. For us, the feature vector describes how much of each color is used in a particular painting. We want to measure the similarity of two paintings in terms of these feature vectors. The magnitude of a vector is the square root of the sum of the squares of the vector's elements. But the direction gives an indication of the relative values of the different elements -- relative to each other. Since two paintings could look very similar, but due to different sizes or digital resolution, could have vastly different counts of the pixels of each color, we want to avoid using magnitude of the vectors in our similarity comparison. Rather we will pay attention only to the relative directions. If the two vectors have the same direction, then the angle between them is zero, and we will call their similarity "high" and equal to 1.0. If the directions are as different as possible, the angle between them will be Pi over 2. The directions of our feature vectors cannot be negative due to the fact that all the counts of pixels are 0 or higher. So we cannot have directional differences of pi radians. We will consider 0.0 to be the lowest similarity value, and it will occur when the directions of the vectors are perpendicular to each other (in whatever high-dimensional space we are using).
Let A and B be two n-dimensional vectors. Then the cosine similary value between them
is computed using the following formula.
The numerator here is called the dot product of A and B. The denominator here is a scalar product of two magnitude values: the magnitude of A times the magnitude of B. When you implement this, you may wish to implement separate "helper" methods for magnitude and for dot product. If you do this, you are free to name those and design them however you like.
The writeup is part of Assignment 4 and should be turned in with the Assignment 4 materials. Create a text file named A4-Report.txt. Put the following information into it.
Name: UWNetID: Section: 1. Describe your approach to implementing the hash table. (separate arrays for keys and values vs one array with objects that represent pairs.) 2. If you implemented any methods other than those in the specification, describe them briefly here. 3. When you use 6 bits per pixel, how many black pixels are there in the Mona Lisa image? (These are the pixels whose ColorKey bits value equals 0.) 4. Provide a copy of the table of counts produced by your collisionTests method. 5. Provide a copy of the table of similarity values produced by your fullSimilarityTests method. 6. Examine the hashCode method of class ColorKey. What types of images might tend to cause lots of collisions relative to other images? 7. (Answer this only if you are doing the optional makeup-credit item)
By the Assignment 3 turn-in deadline, turn in your version of the ColorHash.java file. The team member whose last name appears first in an alphabetical ordering should turn in the file. If the team is a partnership, the file must have both partners' names on it in a comment at the top of the file. Turn in the following three files through our Catalyst CollectIt dropbox. This assignment is due Friday, October 28 at 11:59 PM.
Please be aware of the lateness policy.
If you lost any points on Assignment 1 (due to lateness, wrong files submitted or just points off), you can make up as much as 50 of those points by completing this "makeup credit option." Alternatively you may do this option and make up as many as 10 points that you miss on this assignment (A3) but not for points off due to lateness on A3.
Choose 10 different images of paintings from the web. You may include the three given images among these if you wish. In your report (under item 7), provide illustrations of the 10 images, clearly numbered from 1 to 10. Identify the source of each image (i.e., URL).
Create a table with 10 rows and 10 columns that gives the results of running the cosine similarity analysis on each pair. Present the table neatly in your rep;ort.
Also in your report, indicate which pair was most similar (not including any comparisons of an image to itself).
For creating the entries in this table, use only the 6 bits-per-pixel option for the ColorKey items.
Note: Credit in this option will not be awarded if this assignment (A3) is submitted late.
Design your ColorHash implementation so that it is modular and easy to debug. Think about appropriate helper methods to implement. Many of the operations require locating the hash table cell where a particular key resides or should reside. This would be a natural chunk of functionality to encapsulate in its own method, for example. When debugging, it can be helpful to have a method that prints out a nicely formatted "picture" of the entire hash table, showing the array indices, keys stored there, and values stored there. It's also helpful to know what locations are being probed during collision resolution.
If you are not familiar with JUnit testing, this assignment provides a nice opportunity to try it out and become a little more familiar with it. Eclipse makes it easy to incorporate JUnit testing into your project, and it can simply come down to putting one extra Java file into the project that specifies what testing you want performed. There are a number of adequate if not perfect online videos that explain the basics of JUnit testing.
A few tests are now available here (new version with more tests, as of Oct. 27, with modification to accept the throwing of a RuntimeException in getValueAt). Your getValueAt may also just return -1L when location specified is not being used by the hash table. Old version (for the record or if you want to start testing with fewer tests).
October 27 updates:
Version 1.0. November 1, 2016. last updated 12:11 PM (with suggestions for initial tableSize, and rehashLoadFactor in the Assignment 4 part of the project ). Previous update Oct. 27: 5:14 PM (updated the test file to the latest version: UnitTestColorHash-V0.51.java.
© S. Tanimoto, Oct., 2016.