CSE576 Spring 2005

Project 1

T. Scott Saponas

Introduction

For project 1 I implemented a feature detector and both a simple window descriptor and my own window descriptor. Examples of each of these feature descriptors in action and benchmarks against the given benchmark picture set are given below.

Simple Window Descriptor

The main goal of my simple window descriptor is to work well for translation. I tried several diffrent window descriptors. Initially I tried a 3x3 grayscale descriptor. That didn't work very well so then I tried a 5x5 then a 9x9. Those didn't work very well either. I though maybe I just needed a few more datapoints. So, I tried seperating the color channels. That didn't help much either. I think the main reason why this didn't work is when a feature in one image is just a few pixels off in another image much of the window changes. So one way to solve this is to make the window much bigger. Another student, Yongjoon Lee, suggested to me that one way of making a bigger window but not being as sensitive to the individual pixels is to essentially downsample. So the way I do this is I take a 55x55 pixel window around a pixel of interest. Starting in one corner I use a 10x10 pyramid mask to extract 1 pixel from the 10x10 region in that corner. Then I move the pyramid over by 5 pixels (so there is 5 pixels of overlap) and extract another pixel. After processing a row I move down by 5 pixels and do it again. So from a 55x55 window I extract a 9x9 pixel descriptor (so an 81 dimension feature vector) based on a bunch of pyramid samples around the window.  I actually do this per color channel so its really an 81 x 3 dimension feature vector.  This seemed to work pretty well for translation. Below I show how many of the features are matched in the translated images.

Translated Images

Here are some examples of features that my simple window descriptor could match on a set of translated pictures that were provided.









My Feature Descriptor

I chose to implement a simple version of a SIFT feature. I was hoping I could get a decent rotation and translation invarient descriptor by capturing the magnitudes of the gradient for pixels in the feature window. To make this roation invarient I bin these magnitudes according to the direction of the gradient in relation to the principle direction of the feature. To be precise, I get the princple direction for a feature from the eigenvector corresponding to the larger eigen value of the harris matrix. I then look at a 9x9 window around the pixel of interest and compute the direction and magnitude of the gradient at those pixels (I actually precompute these values). I have eight bins corresponding to 0-45 degrees, 45-90 degrees, and so on to 360. For each of these pixels I look at the diffrence between the princple direction and the gradient direction and then add the magnitude to the corresponding bin using a weight from a 9x9 gaussian. Thus, for each feature I extract an 8 dimension feature vector.

This seems to work ok, but not that well as you can see by the benchmark (below) and by the performance on some images of a Coke with Lime can (also below). I also tried this out on some other roated images and it preformed a little better. I think part of the reason why it doesn't do better on this coke can is the when the features are off by a few pixels on the coke can the areas of interest are very diffrent. More specifically, since I calculate the principle direction from just a small window around the pixel of interest this could change by quite a bit if the corresponding features are a few pixels off. The result of the princple direction being calculated wrong is putting the gradient magnitudes into bins by relative gradient direction is no longer rotation invarient. It also seems that just implementing some of SIFT features is no where near as effective as implementing full SIFT features.

Benchmark Performance

Simple Window Descriptor: average error: -1.#IND00 pixels
My Feature Descriptor: average error: 313.838991 pixels
Provided SIFT features: average error: 7.40 pixels
testSIFTMatch img1.key img2.key H1to2p 1 = 1.231172
testSIFTMatch img1.key img3.key H1to3p 1 = 2.459028
testSIFTMatch img1.key img4.key H1to4p 1 = 3.334123
testSIFTMatch img1.key img5.key H1to5p 1 = 22.559309

In doing this matching I was threasholding on the ratio of the best match to the second best match. I calibrated the threashold for what worked well with the simple window descriptor on the translation images and what worked a little bit for my descriptor on the rotated coke can below.

Strengths and Weaknesses

The strengths are that my simple window descriptor seems to work to some extent on translated images. Also, my window descriptor works a little bit on rotated images (see coke can below). However, the weaknesses are that my features descriptor, that I wanted to be roation invariant, seems to only be a little rotation invariant. In the benchmark scores above we can see that SIFT does really well. This can be explained by both SIFT features being good and also I have the threashold set at a rather selective level. My simple window descriptor seems to do very badly. I think an average error of "-1.#IND00" pixels is being reported because no features get matched. This is not that suprising since at my current selectivity level a simple window is not going to be invariant at all to the change in perspective that is done in the GRAF image set. Similarly, my feature descriptor gets about 300 pixels of error on average on the GRAF set. This is not that suprising either because it was only designed to be roation/translation invarient not scale/affine/perspective invariant.

Coke with Lime

Here are some examples of features that my feature descriptor could match on a set of pictures of a coke can I took where I mostly just rotated the camera between pictures.