For this project, we were to implement a detector for finding
repeatable locations in an image, and devise feature descriptors that
are invariant to different kinds of transformations (e.g., rotations,
changes in illumination). My main focus was to experiment with
different kinds of descriptors, and to gain some intuition about what
descriptors work well.
For feature detection, I implemented the multi-scale Harris corner
detector describes in [1]. Given the locations found with this
detector, I tested several descriptors:
- Oriented grayscale patches [1]
- Variants of 1:
- Color patches
- Patches in log-polar coordinates (a different
parameterization of image space)
- Distance transforms
I expected that using color would give a performance boost because it
has more information than grayscale, that sampling in log-polar
coordinates would make the descriptor more robust to affine image
warps, and that using distance transforms would improve robustness to
affine warps and errors in feature localization and orientation. My
tests were not extensive enough to show whether or not my expectations
were correct, but for the most part the grayscale patch variants and
distance transform descriptors fared better than the grayscale
patches. The distance transforms and log-polar features performed
better than the others on average. However, for panorama stitching
applications, any of these descriptors would likely work well in most
cases.
[Back to top]
Feature detection
Given an input image, I first create an image pyramid by repeatedly
convolving the image with a Gaussian with sigma=1.0 and downsampling
by two. To detect feature points, I follow these steps for each level
of the pyramid:
- First, I run the Harris corner detector (with weights given by a
Gaussian with sigma=1.6) over the image, and mark all pixels whose
cornerness is a maximum in a 3x3 window as candidate features.
- Next, I throw out all candidates except the 500 with the highest
cornerness.
- I then find a subpixel location for each feature by fitting a
quadratic to the local 3x3 neighborhood, and solving for the maximum
analytically. A quadratic function with two independent variables has
six coefficients, and the 3x3 neighborhood provides nine equations, so
the coefficients of the best fit quadratic can be found by solving a
least-squares problem.
- Finally, I find an orientation for the subpixel location by
computing the local gradient of the image smoothed by a Gaussian with
sigma=1.2
Feature descriptors
I implemented four kinds of descriptors:
|
Oriented grayscale patches. These are the features
described in [1]. For each feature location, I sample 9x9 patches at
4 times the image scale (i.e., two levels up in the image pyramid). I
then normalize each patch so that its mean is 0 and standard deviation
is 1.
|
|
Color patches. These are the same as 1, but instead of
combining the color channels into a grayscale image, I keep them
separate, and concatenate them into a vector three times as long.
Rather than using RGB space, I first convert the colors to LUV space
(this gave a modest improvement). I didn't normalize each channel
separately; instead, I assumed that intensity changes would not affect
the chromanance of a pixel (at least by too much). Hence, I
normalized the L channel and divided the U and V channels by the
(pre-normalized) standard deviation of the L channel.
|
|
Log-polar patches. Polar coordinates are an alternative
parameterization of image space to Cartesian coordinates. To convert
from Cartesian to polar, we can use the equations:
rho = sqrt((x - xc) * (x - xc) + (y - yc) * (y - yc))
theta = atan((y - yc) / (x - xc))
where (xc, yc) is the origin of the image.
Log-polar coordinates are the same as polar coordinates, except that
the rho parameter grows logarithmically, rather than linearly, with
the distance to the origin:
rho = log sqrt((x - xc) * (x - xc) + (y - yc) * (y - yc))
theta = atan((y - yc) / (x - xc))
A uniform grid in log-polar coordinates is shown in the image to the
right.
|
|
I thought that sampling the image in log-polar rather than Cartesian
coordinates around a pixel might have some benefits. Log-polar
sampling has approximately the same effect as comparing rectangular
patches by using a Gaussian to downweight samples as a function of
their distance from the center of the patch, but can be represented
more compactly. Along a ray shooting out of the center of a patch in
any direction, a scale or shear will move pixels a distance that grows
linearly with the distance along the ray, and log-polar coordinates
seem better equipped (than Cartesian coordinates) to handle this kind
of pixel uncertainty, so I expected that they would be more robust to
perspective image warps than normal grayscale patches.
For the experiments presented below, to create log-polar features
around an image location, I divided the rho and theta dimensions into
eight bins each, giving 64 total bins. The smallest ring had width 1,
and each successive ring was 1.5 times the width of the previous ring.
For each pixel contained in the largest ring, I computed its rho and
theta coordinates, and used binlinear interpolation to divide its
intensity into the nearest four bins. In the end, this resulted in a
feature with 64 coordinates.
|
|
Distance transforms. A distance transform of a grayscale
image is a function over a 3D grid in (x, y, intensity) space where
each grid cell contains the distance to the closest point in the
image. The closest point can be defined in any number of ways, but
here I found the point with the smallest weighted Euclidean distance:
sqrt((x - x')^2 + (y - y')^2 + lambda * (i - i')^2)
where lambda=0.012 (this number was experimentally determined to give
good results).
Normally, distance transforms are used to compute the chamfer distance
between two points sets efficiently. However, I don't know if there
is a way to accelerate nearest neighbor queries under a chamfer
distance (such as building a kd-tree). Instead of matching image
patches by finding nearest neighbors under the chamfer distance, I
compare the distance transforms of the two patches directly, by
treating them as vectors and computing the Euclidean distance. I do
not have a good intuition about what it means to compare two distance
transforms, but my rough guess is that the distance between two
distance transforms is an approximation to the chamfer distance (or a
scaled version thereof). Actually, I tried the chamfer matching
technique, but it seemed to have worse performance than comparing the
distance transforms themselves. I'm not sure why this would be the
case; further investigation is required.
The reason why I thought to compare distance transforms is that the
chamfer distance between two images is often more robust to small
geometric image warps than the Euclidean distance. Therefore, I
expected that distance transform descriptors would handle small
discepancies in patch location and orientation better than grayscale
patches.
To compute a distance transform feature around an (oriented) feature
location, I computed a distance transform for a 17x17 patch centered
around that location at twice the scale, after bias and gain
normalizing the patch. I used 3-bit color in the distance transform
(i.e., the distance transform grid was 17x17x8). I then used the
9x9x8 sub-grid in the center of the distance transform as the feature.
|
Feature matching
Finally, for the feature matching part of the project, I implemented
two methods for detecting correct matches: thresholding the distance
to the closest match, and thresholding the ratio of the distances to
the first and second closest matches (I used the ANN kd-tree
implementation to speed up the nearest neighbor queries). The ratio
threshold test significantly outperformed the distance threshold test.
[Back to top]
Benchmark performance
I compared the performance of the four feature descriptors and SIFT
[3] on the four benchmark sets. I ran two basic experiments
Experiment 1 For each test set, I matched each feature in
image 1 to the closest feature in image k, and computed the percent of
matches whose ground truth error was less than 3 pixels. The results
are shown in the following four graphs. The index of the image the
first image is compared with runs along the x-axis, and the percent of
correct matches runs along the y-axis.
Graf sequence
| Leuven sequence
|
|
|
Bikes sequence
| Wall sequence
|
|
|
Generally, the log-polar patches and distance transforms had more
correct matches than the grayscale patches. This better performance
is likely attributable to their increased robustness to geometric
warps in the image, although further experiments are required to show
this more rigorously (it's unclear, for instance, why the log-polar
does much better on the wall sequence, but only slightly better on the
graf sequence). No descriptor did well for extreme shifts in
perspective, however (although this may be due to problems with
detection, as well). The color patches also had better performance,
except in the case of the leuven test set. This set has significant
intensity variation, so the relatively poor performance of leuven in
this case shows that my constant chromance assumption was not
well-founded (as Rick mentioned in class, cameras can make automatic
light temperature adjustments). The performance of SIFT varied quite
a bit: in the bikes sequence, it generally had the best performance
(it might be that gradient directions are more robust to blurring than
grayscale values), but in the wall sequence it had the lowest
performance.
Experiment 2. For the second test, I computed the same
percentage as in test 1, but only considered matches where the ratio
of the distances to the first and second closest matches was less than
0.6.
Graf sequence
| Leuven sequence
|
|
|
Bikes sequence
| Wall sequence
|
|
|
Using the ratio test to cull bad matches is quite effective, and
increased the percentage of correct matches significantly for all of
the descriptors. However, performance still fell sharply with
changing perspective.
Finally, here is a screenshot showing matching features (using the
distance transform descriptors) between two images of Half Dome in
Yosemite National Park (I tried to match all the features, then kept
the matches that passed the ratio test). The matches deemed correct
are shown in green.
[Back to top]
The design space for feature descriptors is extremely large, and I
tried out just a few possibilities in this project. Though features
based on distance transforms performed fairly well, they are quite a
bit larger than intensity windows because of the added dimension, and
take longer to compute (the size could perhaps be reduced by
dimension-reduction techniques, however). The log-polar descriptors
also performed well, and are just as small as the rectangular
grayscale windows and quite quick to compute. However, more rigorous
experiments are needed to characterize the behaviour of these
descriptors more exactly. For panorama stitching, any of these
descriptors will likely work well under normal circumstances.
[Back to top]
[1] Matthew Brown, Szeliski, Richard, and Winder, Simon.
Multi-Image Matching Using Multi-Scale Oriented Patches.
CVPR 2005 (to appear).
[2] S. Arya, D. M. Mount, N. S. Netanyahu, R. Silverman, and A. Y. Wu.
An Optimal Algorithm for Approximate Nearest Neighbor Searching
. Journal of the ACM, 45 (1998), 891-923.
[3] Distinctive image features from scale-invariant keypoints David
G. Lowe, International Journal of Computer Vision, 60, 2 (2004),
pp. 91-110.
[Back to top]