CSE576 Project 1 - Peter Henry

Peter Henry
CSE576 Computer Vision
Project 1

Feature Detector

I implemented only the harris corner detector as my feature detector. I associated it with command line option 1 because it was my only feature detector. I first convolve the original grayscale image with the sobel X and Y filters. Then I create three images (I_x * I_x), (I_x * I_y), and (I_y * I_y), storing the appropriate products of sobel pixels. Each of these three images is convolved with a 5x5 gaussian kernel (implemented using the equivalent 7x1 separable kernel). Finally, I compute the ratio of the determinant and trace of the harris matrix, and store this as my harris values. To compute the local maxima, I ensure that a point is the maximum in its 3x3 window and above a threshold of 0.01. Following are harris value images for "graf/img1.ppm" and "yosemite/yosemite1.jpg":

Feature Descriptor

I first implemented the simple 5x5 grayscale image patch descriptor, the results of which are included in the plots of the result section. For my more general feature descriptor, I took inspiration from "Multi-Image Matching using Multi-Scale Oriented Patches" by Brown, Szeliski, and Winder. I continued to use my single-scale harris corner detector, but for my descriptor, I do the following:

I compute a blurred image from which to compute the dominant orientation of the feature. I convolve the grayscale source image with a gaussian of sigma 4.5, using a separable filter of dimensions 21x1 (essentially, a 21x21 gaussian filter). This gives a very smooth gradient field so that small variations in feature location will not substantially alter the gradient.
I compute a separate blurred image from which to extract a rotation invariant image patch. I convolve the grayscale source with a gaussian of sigma sqrt(5) and dimensions 13x13. I then take bilinear samples from this image to store a 7x7 image patch. These samples are taken from rotated positions relative to the feature location in order to make the descriptor rotationally invariant. I sample every 5th pixel, meaning the patch samples a 35 by 35 square (or a little bit more, if you count the contributions from the blurred pixels along the edge of the patch). This blurring is to help the feature be invariant to small changes in feature location and rotation. However, this should still generate a unique descriptor because of the large area covered by the patch.
I compute the mean and standard deviation of the pixels in the image patch. For each patch pixel, I subtract the mean and divide by the standard deviation. This normalization is to make the descriptor invariant to illumination variation.

An important note is that I do not generate patches for features that whose image patch would fall outside the image. These patches would have reduced matchability due to the missing information. Therefore, this image descriptor will not generate features along the edges of an image.

Results

ROC Plot for graf (my descriptor):
AUC for SSD matching:0.632751
AUC for Ratio matching:0.767705

ROC Plot for graf (simple 5x5 descriptor):
AUC for SSD matching:0.653630
AUC for Ratio matching:0.657511

ROC Plot for yosemite (my descriptor):
AUC for SSD matching:0.950955
AUC for Ratio matching:0.976500

ROC Plot for yosemite (simple 5x5 descriptor):
AUC for SSD matching:0.781920
AUC for Ratio matching:0.894587

Benchmarking results (all use ratio matching):

bikes:
average error: 324.398513 pixels
average AUC: 0.653120

graf:
average error: 294.377062 pixels
average AUC: 0.542049

leuven:
average error: 165.489341 pixels
average AUC: 0.768668

wall:
average error: 278.718948 pixels
average AUC: 0.664109

Following is the result of detecting features in a test photo. Because I have no ground truth for my own images, measuring the usefulness of these features is difficult at this juncture.

Discussion

My feature descriptor is reasonably invariant with respect to translation, orientation (rotations about the camera view ray), and illumination. It is also readily understood and efficiently implemented.

One weakness of my method is the fixed threshold for determining harris corner points. The threshold I selected generates 14779 feature points for "wall/img1.ppm" but only 86 feature points for "bikes/img6.ppm". The wall image has many, many locally invariant features while the bikes image is quite blurry, resulting in very few locally invariant features. This suggests that an adaptive threshold would substantially even out the number of features per image, and would eliminate the guesswork used to determine my threshold. Also, an adaptive threshold could be used to spread the feature points spatially across the image.

My feature points are not scale invariant, with the only invariance coming from the blurring before sampling of the image patch. This may explain the weak performance on the graf dataset, in which many feature points change depth between images. Using multi-scale harris features could alleviate this problem.

Though my feature points attempt to be invariant to rotations around an axis parallel to the view ray, they are not invariant to planar rotations around other axes. Again, the blurred, large image patch makes an attempt at invariance to small rotations of this type, large transformations between images will foil my descriptor.

Finally, it is worth noting that my descriptor uses only grayscale values, meaning that intensity equivalent patches of different colors will be indistinguishable.

Possible Improvements

The first thing to do would be to implement an adaptive threshold. Also, the size of the gaussians used for determining dominant orientation and for patch sampling should be revisited and tweaked with more care.