CSE576 Spring 2005
Project 1
T. Scott Saponas
Introduction
For project 1 I implemented a feature detector and both a simple window
descriptor and my own window descriptor. Examples of each of these feature
descriptors in action and benchmarks against the given benchmark picture set
are given below.
Simple Window Descriptor
The main goal of my simple window descriptor is to work well for translation. I
tried several diffrent window descriptors. Initially I tried a 3x3 grayscale
descriptor. That didn't work very well so then I tried a 5x5 then a 9x9. Those
didn't work very well either. I though maybe I just needed a few more
datapoints. So, I tried seperating the color channels. That didn't help much
either. I think the main reason why this didn't work is when a feature in one
image is just a few pixels off in another image much of the window changes. So
one way to solve this is to make the window much bigger. Another student,
Yongjoon Lee, suggested to me that one way of making a bigger window but not
being as sensitive to the individual pixels is to essentially downsample. So
the way I do this is I take a 55x55 pixel window around a pixel of interest.
Starting in one corner I use a 10x10 pyramid mask to extract 1 pixel from the
10x10 region in that corner. Then I move the pyramid over by 5 pixels (so there
is 5 pixels of overlap) and extract another pixel. After processing a row I
move down by 5 pixels and do it again. So from a 55x55 window I extract a 9x9
pixel descriptor (so an 81 dimension feature vector) based on a bunch
of pyramid samples around the window. I actually do this per color
channel so its really an 81 x 3 dimension feature vector. This seemed to
work pretty well for translation. Below I show how many of the features are
matched in the translated images.
Translated Images
Here are some examples of features that my simple window descriptor could match
on a set of translated pictures that were provided.
My Feature Descriptor
I chose to implement a simple version of a SIFT feature. I was hoping I could
get a decent rotation and translation invarient descriptor by capturing the
magnitudes of the gradient for pixels in the feature window. To make this
roation invarient I bin these magnitudes according to the direction of the
gradient in relation to the principle direction of the feature. To be precise,
I get the princple direction for a feature from the eigenvector corresponding
to the larger eigen value of the harris matrix. I then look at a 9x9 window
around the pixel of interest and compute the direction and magnitude of the
gradient at those pixels (I actually precompute these values). I have eight
bins corresponding to 0-45 degrees, 45-90 degrees, and so on to 360. For each
of these pixels I look at the diffrence between the princple direction and the
gradient direction and then add the magnitude to the corresponding bin using a
weight from a 9x9 gaussian. Thus, for each feature I extract an 8 dimension
feature vector.
This seems to work ok, but not that well as you can see by the benchmark
(below) and by the performance on some images of a Coke with Lime can (also
below). I also tried this out on some other roated images and it preformed a
little better. I think part of the reason why it doesn't do better on this coke
can is the when the features are off by a few pixels on the coke can the areas
of interest are very diffrent. More specifically, since I calculate the
principle direction from just a small window around the pixel of interest this
could change by quite a bit if the corresponding features are a few pixels off.
The result of the princple direction being calculated wrong is putting the
gradient magnitudes into bins by relative gradient direction is no longer
rotation invarient. It also seems that just implementing some of SIFT features
is no where near as effective as implementing full SIFT features.
Benchmark Performance
Simple Window Descriptor: average error: -1.#IND00 pixels
My Feature Descriptor: average error: 313.838991 pixels
Provided SIFT features: average error: 7.40 pixels
testSIFTMatch img1.key img2.key H1to2p 1 = 1.231172
testSIFTMatch img1.key img3.key H1to3p 1 = 2.459028
testSIFTMatch img1.key img4.key H1to4p 1 = 3.334123
testSIFTMatch img1.key img5.key H1to5p 1 = 22.559309
In doing this matching I was threasholding on the ratio of the best match to
the second best match. I calibrated the threashold for what worked well with
the simple window descriptor on the translation images and what worked a little
bit for my descriptor on the rotated coke can below.
Strengths and Weaknesses
The strengths are that my simple window descriptor seems to work to some extent
on translated images. Also, my window descriptor works a little bit on rotated
images (see coke can below). However, the weaknesses are that my features
descriptor, that I wanted to be roation invariant, seems to only be a little
rotation invariant. In the benchmark scores above we can see that SIFT does
really well. This can be explained by both SIFT features being good and also I
have the threashold set at a rather selective level. My simple window
descriptor seems to do very badly. I think an average error of "-1.#IND00"
pixels is being reported because no features get matched. This is not that
suprising since at my current selectivity level a simple window is not going to
be invariant at all to the change in perspective that is done in the GRAF image
set. Similarly, my feature descriptor gets about 300 pixels of error on average
on the GRAF set. This is not that suprising either because it was only designed
to be roation/translation invarient not scale/affine/perspective invariant.
Coke with Lime
Here are some examples of features that my feature descriptor could match on a
set of pictures of a coke can I took where I mostly just rotated the camera
between pictures.