CSE576 Project 1:
Feature Detection & Matching

Noah Snavely --- 04/14/2005

 

Introduction Exposition Results Conclusion References

Introduction

For this project, we were to implement a detector for finding repeatable locations in an image, and devise feature descriptors that are invariant to different kinds of transformations (e.g., rotations, changes in illumination). My main focus was to experiment with different kinds of descriptors, and to gain some intuition about what descriptors work well.

For feature detection, I implemented the multi-scale Harris corner detector describes in [1]. Given the locations found with this detector, I tested several descriptors:

  1. Oriented grayscale patches [1]
  2. Variants of 1:
    1. Color patches
    2. Patches in log-polar coordinates (a different parameterization of image space)
  3. Distance transforms
I expected that using color would give a performance boost because it has more information than grayscale, that sampling in log-polar coordinates would make the descriptor more robust to affine image warps, and that using distance transforms would improve robustness to affine warps and errors in feature localization and orientation. My tests were not extensive enough to show whether or not my expectations were correct, but for the most part the grayscale patch variants and distance transform descriptors fared better than the grayscale patches. The distance transforms and log-polar features performed better than the others on average. However, for panorama stitching applications, any of these descriptors would likely work well in most cases.

[Back to top]


Exposition

Feature detection

Given an input image, I first create an image pyramid by repeatedly convolving the image with a Gaussian with sigma=1.0 and downsampling by two. To detect feature points, I follow these steps for each level of the pyramid:
  1. First, I run the Harris corner detector (with weights given by a Gaussian with sigma=1.6) over the image, and mark all pixels whose cornerness is a maximum in a 3x3 window as candidate features.
  2. Next, I throw out all candidates except the 500 with the highest cornerness.
  3. I then find a subpixel location for each feature by fitting a quadratic to the local 3x3 neighborhood, and solving for the maximum analytically. A quadratic function with two independent variables has six coefficients, and the 3x3 neighborhood provides nine equations, so the coefficients of the best fit quadratic can be found by solving a least-squares problem.
  4. Finally, I find an orientation for the subpixel location by computing the local gradient of the image smoothed by a Gaussian with sigma=1.2

Feature descriptors

I implemented four kinds of descriptors:
Oriented grayscale patches. These are the features described in [1]. For each feature location, I sample 9x9 patches at 4 times the image scale (i.e., two levels up in the image pyramid). I then normalize each patch so that its mean is 0 and standard deviation is 1.

Color patches. These are the same as 1, but instead of combining the color channels into a grayscale image, I keep them separate, and concatenate them into a vector three times as long. Rather than using RGB space, I first convert the colors to LUV space (this gave a modest improvement). I didn't normalize each channel separately; instead, I assumed that intensity changes would not affect the chromanance of a pixel (at least by too much). Hence, I normalized the L channel and divided the U and V channels by the (pre-normalized) standard deviation of the L channel.

Log-polar patches. Polar coordinates are an alternative parameterization of image space to Cartesian coordinates. To convert from Cartesian to polar, we can use the equations:

rho = sqrt((x - xc) * (x - xc) + (y - yc) * (y - yc))
theta = atan((y - yc) / (x - xc))

where (xc, yc) is the origin of the image.

Log-polar coordinates are the same as polar coordinates, except that the rho parameter grows logarithmically, rather than linearly, with the distance to the origin:

rho = log sqrt((x - xc) * (x - xc) + (y - yc) * (y - yc))
theta = atan((y - yc) / (x - xc))

A uniform grid in log-polar coordinates is shown in the image to the right.

I thought that sampling the image in log-polar rather than Cartesian coordinates around a pixel might have some benefits. Log-polar sampling has approximately the same effect as comparing rectangular patches by using a Gaussian to downweight samples as a function of their distance from the center of the patch, but can be represented more compactly. Along a ray shooting out of the center of a patch in any direction, a scale or shear will move pixels a distance that grows linearly with the distance along the ray, and log-polar coordinates seem better equipped (than Cartesian coordinates) to handle this kind of pixel uncertainty, so I expected that they would be more robust to perspective image warps than normal grayscale patches.

For the experiments presented below, to create log-polar features around an image location, I divided the rho and theta dimensions into eight bins each, giving 64 total bins. The smallest ring had width 1, and each successive ring was 1.5 times the width of the previous ring. For each pixel contained in the largest ring, I computed its rho and theta coordinates, and used binlinear interpolation to divide its intensity into the nearest four bins. In the end, this resulted in a feature with 64 coordinates.


Distance transforms. A distance transform of a grayscale image is a function over a 3D grid in (x, y, intensity) space where each grid cell contains the distance to the closest point in the image. The closest point can be defined in any number of ways, but here I found the point with the smallest weighted Euclidean distance:
sqrt((x - x')^2 + (y - y')^2 + lambda * (i - i')^2)
where lambda=0.012 (this number was experimentally determined to give good results).

Normally, distance transforms are used to compute the chamfer distance between two points sets efficiently. However, I don't know if there is a way to accelerate nearest neighbor queries under a chamfer distance (such as building a kd-tree). Instead of matching image patches by finding nearest neighbors under the chamfer distance, I compare the distance transforms of the two patches directly, by treating them as vectors and computing the Euclidean distance. I do not have a good intuition about what it means to compare two distance transforms, but my rough guess is that the distance between two distance transforms is an approximation to the chamfer distance (or a scaled version thereof). Actually, I tried the chamfer matching technique, but it seemed to have worse performance than comparing the distance transforms themselves. I'm not sure why this would be the case; further investigation is required.

The reason why I thought to compare distance transforms is that the chamfer distance between two images is often more robust to small geometric image warps than the Euclidean distance. Therefore, I expected that distance transform descriptors would handle small discepancies in patch location and orientation better than grayscale patches.

To compute a distance transform feature around an (oriented) feature location, I computed a distance transform for a 17x17 patch centered around that location at twice the scale, after bias and gain normalizing the patch. I used 3-bit color in the distance transform (i.e., the distance transform grid was 17x17x8). I then used the 9x9x8 sub-grid in the center of the distance transform as the feature.


Feature matching

Finally, for the feature matching part of the project, I implemented two methods for detecting correct matches: thresholding the distance to the closest match, and thresholding the ratio of the distances to the first and second closest matches (I used the ANN kd-tree implementation to speed up the nearest neighbor queries). The ratio threshold test significantly outperformed the distance threshold test.

[Back to top]


Results

Benchmark performance

I compared the performance of the four feature descriptors and SIFT [3] on the four benchmark sets. I ran two basic experiments

Experiment 1 For each test set, I matched each feature in image 1 to the closest feature in image k, and computed the percent of matches whose ground truth error was less than 3 pixels. The results are shown in the following four graphs. The index of the image the first image is compared with runs along the x-axis, and the percent of correct matches runs along the y-axis.

Graf sequence Leuven sequence
Bikes sequence Wall sequence

Generally, the log-polar patches and distance transforms had more correct matches than the grayscale patches. This better performance is likely attributable to their increased robustness to geometric warps in the image, although further experiments are required to show this more rigorously (it's unclear, for instance, why the log-polar does much better on the wall sequence, but only slightly better on the graf sequence). No descriptor did well for extreme shifts in perspective, however (although this may be due to problems with detection, as well). The color patches also had better performance, except in the case of the leuven test set. This set has significant intensity variation, so the relatively poor performance of leuven in this case shows that my constant chromance assumption was not well-founded (as Rick mentioned in class, cameras can make automatic light temperature adjustments). The performance of SIFT varied quite a bit: in the bikes sequence, it generally had the best performance (it might be that gradient directions are more robust to blurring than grayscale values), but in the wall sequence it had the lowest performance.


Experiment 2. For the second test, I computed the same percentage as in test 1, but only considered matches where the ratio of the distances to the first and second closest matches was less than 0.6.

Graf sequence Leuven sequence
Bikes sequence Wall sequence

Using the ratio test to cull bad matches is quite effective, and increased the percentage of correct matches significantly for all of the descriptors. However, performance still fell sharply with changing perspective.


Finally, here is a screenshot showing matching features (using the distance transform descriptors) between two images of Half Dome in Yosemite National Park (I tried to match all the features, then kept the matches that passed the ratio test). The matches deemed correct are shown in green.

[Back to top]


Conclusion

The design space for feature descriptors is extremely large, and I tried out just a few possibilities in this project. Though features based on distance transforms performed fairly well, they are quite a bit larger than intensity windows because of the added dimension, and take longer to compute (the size could perhaps be reduced by dimension-reduction techniques, however). The log-polar descriptors also performed well, and are just as small as the rectangular grayscale windows and quite quick to compute. However, more rigorous experiments are needed to characterize the behaviour of these descriptors more exactly. For panorama stitching, any of these descriptors will likely work well under normal circumstances.

[Back to top]


References

[1] Matthew Brown, Szeliski, Richard, and Winder, Simon. Multi-Image Matching Using Multi-Scale Oriented Patches. CVPR 2005 (to appear).
[2] S. Arya, D. M. Mount, N. S. Netanyahu, R. Silverman, and A. Y. Wu. An Optimal Algorithm for Approximate Nearest Neighbor Searching . Journal of the ACM, 45 (1998), 891-923.
[3] Distinctive image features from scale-invariant keypoints David G. Lowe, International Journal of Computer Vision, 60, 2 (2004), pp. 91-110.

[Back to top]