CSE576 (Spring 2005) Project 1

Harsha V. Madhyastha

Objective: Devise mechanisms to detect, describe and match features in images

We convert the image to grayscale before performing any operations on it.

Detecting interest points

We use the Harris corner measure to detect interest points. Given an image, at every point, we compute the gradient along the X and Y directions. The gradient along a direction is computed by subtracting out the value of the pixel under consideration from that of the pixel adjacent to it along the particular direction. We then compute the 2x2 moments matrix M, the values of whose elements are defined as M(0,0) = Ix * Ix, M(0,1) = M(1,0) = Ix * Iy, and M(1,1) = Iy * Iy (Ix and Iy are the gradient along X and Y direction, respectively). The Harris corner measure is then computed as determinant(M) divided by trace(M). We then identify all points for which this measure is greater than some threshold. Based on our empirical observations, we have determined a threshold of 1.5 to work well. Finally, the points of interest identified are those with a Harris corner measure above this threshold and whose value constitutes a local maximum in the 21x21 window around it. This is essentially the idea of non-maximal suppression from Matt Brown and Rick's CVPR'05 paper.

Feature Descriptor

We first implemented an extremely simple feature descriptor. For each of the points of interest identified as above, we just stored the values of the pixels in the 9x9 window with this point as the center. Matching two images using this feature descriptor can be expected to work reasonably well if the one image is produced by a translation on the other image. However, this descriptor will clearly not work well with changes in intensity, rotation, change in scale, etc.

The second feature descriptor we implemented was aimed to address two of the transformations that the simple descriptor outlined above cannot handle - change in intensity and rotation. Before tackling either of these, we first tried to weed out noise from the image by applying a smoothing filter on it. The filter kernel we used is a 3x3 Gaussian filter, with a weight of 1/16 on each of the four corners, a weight of 1/4 at the center and a weight of 1/8 elsewhere. To account for intensity changes, we subtracted from each pixel value the mean value and divided the result by the standard deviation. This ensures that the image we are handling has mean 0 and standard deviation 1.

Now, just storing the 9x9 window of pixel values around the point of interest will be susceptible to error under rotation. So, instead, we consider a 9x9 window whose orientation is decided by the direction of the gradient at the point of interest. For this, we compute the direction of the gradient at the point of interest, and then rotate the axes such that the Y axis increases along the direction of the gradient. The 9x9 window is now determined based on the new X and Y axes.

Matching features

Given a feature in an image, the two algorithms we employ for determining if a matching feature exists in another image are as follows.

Determine the feature with which the given feature has the least distance and determine whether it is a match based on whether this distance is lesser than some threshold. Based on empirical tests, we have set this threshold to 1.0.
Determine the features with which the given feature has the least and the second least distances and determine whether it is a match based on whether the ratio of the second-least to least distance is greater than some threshold. In our implementation, we have set this threshold to 1.5.

In either algorithm, the distance metric used is just the Euclidean distance between two vectors.

Benchmark tests

Benchmark Set	Simple feature descriptor	Complex feature descriptor
bikes	302	383
graf	287	298
leuven	331	145
wall	223	219

Based on the above results of the benchmark tests, it is tough to conclude whether the complex feature descriptor is really better than the simple one, or in fact maybe even worse! Both descriptors perform comparably on the graf and wall datasets, whereas on the leuven dataset, the complex descriptor performs considerably better. On the other hand, the simple descriptor manages to outdo the complex one on the bikes dataset. So, it looks like our feature descriptor's attempt to handle changes in intensity worked out great, but it does not seem to handle change in focus and rotational transformations too well.
It is also questionable whether the results on the whole are good, irrespective of the descriptor used.

Strengths and Weaknesses

The strengths of our feature descriptor as outlined previously are:

Robust to small addition of noise as we smooth out the image before operating on it.
Can handle changes in intensity as reduce the values of the pixels so that they have 0 mean and standard deviation 1.
Can possibly handle rotation transformations as the window of values stored around a point of interest is relative to the direction of the gradient at that point.

Here are some of the weaknesses of our feature descriptor, which could be the cause for the poor performance observed in the benchmark tests.

Orientation of the 9x9 window is decided based on direction of the gradient at just the point of interest. We should instead be considering the cumulative direction of the gradient in the region around the point of interest.
Our feature descriptor is not scale-invariant. One possible approach to tackle this would be to identifying points of interest not only based on local maxima of Harris corner measure across pixels, but also across scale.
As in Matt Brown and Rick's CVPR'05 paper, we possibly need to store a 9x9 window at a lower scale to achieve better performance.

Test image

We wanted to take a picture of a clock in two different orientations and check out if the feature matching manages to match up corresponding numbers as well as the hands of the clock. It does not look like that turned out all that great! (For just this test, we determined points of interest based on local maxima in a 7x7 neighborhood rather than in a 21x21 neighborhood as used in our benchmark tests.)