CSE576 - Project 1

Computer Vision (CSE576), Spring 2008

Project 1: Feature Detection and Matching

Rahul Garg

Overview of the Feature Detector:

I used the standard Harris corner detector as described on the project web page and in class. However, instead of using a hard threshold to detect local maxima in the harris response of the image, I use an adaptive threshold which is set to be µ+5*σ where µ and σ are the mean and standard deviation of the harris response of the image. This might adversely effect the repeatability of the detector but it performed better than the hard threshold on the provided dataset.

Overview of the Descriptor :

I use a variant of the SIFT descriptor, the Rotation Invariant Feature Transform (RIFT), which was first proposed for the purpose of texture classification in [1]. However, the descriptor is not as popular in the field of object recognition.

RIFT is invariant to rotation and hence it is not required to normalize for rotation first (SIFT normalizes by detecting dominant gradient orientation).

Rift Image

The above figure shows the process of building the RIFT descriptor. The region around the feature point is divided into concentric rings of equal width and for each ring a gradient orientation histogram is built. All gradient orientations are measured relative to the line joining the point to the center. The contribution of a point is weighted by the gradient magnitude at that point and the weighing factor exp(-3.33*r²/R²) where r is the distance of the point from the center and R is the radius of the outer most disc which was chosen to be 40 pixels. Finally, all histograms are concatenated and the resultant vector normalized to yield the final descriptor. In my implementation, I used 5 rings and quantized the orientations into 8 directions to yield a 40-dimensional descriptor. The numerical constants were determined experimentally but finer tuning of parameters may be possible.

I tried improving the performance by changing the descriptor and the match strategy but none of them gave an overall improvement over the dataset. Some of the things I tried:

The gradients around the feature point may be divided into two (or more) sets depending on their magnitude. Descriptors may be learned from the two sets independently and concatenated. It'll reduce the invariance of the descriptor. The idea gives a substantial improvement for the graf dataset but does not perform well across all datasets.
One can try to use the hue information from the image since that would be orthogonal to the grayscale image used to calculate the original descriptor. For e.g. one can build a hue histogram for the region around the feature point. Again, it does not lead to improvement probably because hue is sensitive to noise.
I used χ2 distance to compare the histograms instead of the euclidean distance. Again, it does not give any improvement.
I tried other norms for computing the distance between two feature vectors besides the L2 norm. L1 norm gave a similar performance as the L2 norm but the L3 norm gave poorer results.

Strengths and Weaknesses of the descriptor:

The descriptor is rotation invariant and does not rely on finding the dominant gradient orientation which might be error prone. However, the rotation invariance of the descriptor affects the discriminative power of the descriptor -- for e.g., the gradients in each of the rings may be shuffled by a different permutation and the descriptor would remain the same. The descriptor is much lower dimensional as compared to the SIFT descriptor (40 vs 128) which accelerates the matching. However, the descriptor uses a window of size 80x80, hence it might be a bit expensive to compute. In my opinion, this is an abnormally large window size but the best results were obtained using this value.

Results :

Dataset	Average AUC	Img2	Img3	Img4	Img5	Img6
Graf	0.60	0.90	0.52	0.57	0.49	0.52
Leuven	0.74	0.91	0.79	0.70	0.67	0.65
Bikes	0.65	0.94	0.76	0.59	0.53	0.45
Walls	0.66	0.94	0.75	0.61	0.52	0.49

The above figure shows input images and the corresponding harris responses.

The ROC curves for the graf dataset (img1.ppm and img2.ppm)

The ROC curve for the RIFT descriptor does relatively well but does not beat the SIFT descriptor. However, it does outperform the SIFT descriptor for low false positive rates. . Also interesting is the fact that it does not benefit that much from the ratio test unlike the SIFT descriptor. The window descriptor used is simply the intensity values in a rectangular window of size 7x7 centered around the feature point.

The ROC curves for the Yosemite dataset

The ROC curves for Yosemite dataset also show similar trends. Its interesting to note here that the ratio test actually decreases the accuracy of the window descriptor. A possible explanation could be that these two image have little variance and SSD is a more accurate measure for a highly discriminative descriptor like the window descriptor.

A result on a couple of images I took is shown below (Matched features in green). I would consider it as one of the harder cases due to large textured region (grass) and dynamically changing scene (flowing water).

References:

[1] Lazebnik, S., Schmid, C., and Ponce, J. 2003a. A sparse texture representation using affine-invariant regions, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Madison, Wisconsin, USA, pp. 319-324.