Project 1 Report: Feature Detection and Matching
CSE576, Spring 2008
Xiaoyu Chen
April 16, 2008
In this project, I developed a detector to identify discriminating features from an image, a descriptor to represent each interested feature, and an approach to match features detected in two images. Especially, my algorithms were designed to detect the features that are relatively invariant to translation, rotation, illumination, and scale.
Methods
Feature detector
I first used the Harris operator method. The details of Harris operator were also described on the webpage of the project requirement. Followed is a summary of what I implemented:
- For each pixel in a given image, consider a detector window of size 5x5 around that pixel.
- Compute the Harris matrix H for that pixel. H is a weighted summation of all pixels in the detector window. Because I chose 5x5 detector window, I used 5x5 Gaussian mask as the weight matrix.
- Based on H, compute the corner strength function f = det(H) / trace(H).
- After calculating f for all pixels, label a pixel as “interested” if it has the local maximum f in a 3x3 neighborhood.
- Output the interested pixels that is above a given threshold as features.
Although the method described above is translation invariant, relatively rotation invariant, and relatively illumination invariant, scale invariance remains an issue. To address this, I also tried automatic scale selection. The idea is simple: let the detector window varies among 5x5, 7x7, and 9x9, and a pixel will be labeled as “interested” only if it has the local maximum of f in both a 3x3 neighborhood and all three scales. This method allows the detection of features over different scales.
Feature descriptor
I first tried a naïve feature descriptor, which is simply a 5x5 window around a given feature. This naïve feature descriptor is translation invariant, but it doesn’t take into account rotation, illumination, and scale.
To make the feature descriptor rotation invariant, I adapted a similar method as introduced in MOPS. This method defines the descriptor according to the dominant orientation of a given feature, and employs subsampling coupled with Gaussian pre-filtering to allow a smooth rotation of the descriptor window. The details is described as following:
- Given a detected feature, estimate its dominant orientation by computing the smoothed local gradient. Specifically, the dominant orientation is estimated as the weighted summation of gradient over all pixels in a local 5x5 window around the feature. The weight matrix is again the 5x5 Gaussian mask.
- Take a 40x40 patch around the detected feature in the angle of its dominant orientation.
- For all 40x40 pixels in the patch, perform a 5x5 Gaussian pre-filtering.
- Scale the 40x40 patch down to 1/5 by subsampling 8x8 pixels centered at the feature.
- Rotate the 8x8 window to horizontal in the positive direction of x axis. The feature is finally represented by those 8x8 pixels in order.
To make the feature descriptor illumination invariant, given a 8x8 descriptor window, all intensities in the window were normalized by subtracting the mean and dividing by the standard deviation in the window.
Feature matching
After detecting and describing features, a method is required to compare and match features from two images. Given two feature descriptors, SSD is the summation of square differences between elements of the two descriptors, which had been implemented in the distributed code. I implemented the other method called ratio test. Given a feature F from a image A and all features from a image B, I first used SSD to compute the distances between F and each feature in image B. Then, the metric is computed as the distance ratio between the best match to F and the second best match to F. This metric can actually reduce ambiguous matches.
Results
ROC curves
Followed are two plots of ROC curves (you can click on the figures to zoom in). Please refer to the legend for which approach each color stands for. The left plot was generated from the graf test pairs, and the right one was generated from the Yosemite test pairs. It shows that my descriptor works better than the naïve one, but worse than SIFT.
Images of the Harris operator
Followed are the images for the Harris operator (you can click on the figures to access the .tga files). The first two are for graf, and the next two are for Yosemite.
Average AUC for the benchmark sets
I ran my program on the provided four benchmark sets, and reported the average AUC and the pixel errors for each set in the below table. In the ROC curves of graf (shown above), we can observe that the ratio test outperforms SSD. However, in terms of average AUC for benchmark sets, SSD has a better performance than the ratio test.
Set | avgAUC(SSD) | avgAUC(ratio) | PixelErr |
Leuven | 0.72 | 0.70 | 173.42 |
Wall | 0.77 | 0.61 | 313.80 |
Bike | 0.70 | 0.63 | 334.05 |
Graf | 0.63 | 0.60 | 193.26 |
My own images
I also applied my approach to some images taken by myself. Because of the unknown homography of those images, I was not able to evaluate the performance of my algorithm by ROC and AUC. However, I ran my program to detect features from my images and loaded them into the GUI. The results suggest that my algorithm successfully detected discriminative features. Here I show two examples (you can click on the figures to zoom in).
Discussion
My approach includes three components that handles rotation invariance, illumination invariance, and scale invariance. I developed them as independent sub-functions, being called with different options. I also performed experiments to evaluate the effects of each individual component. Rotation invariance and scale invariance in general improved the performance. However, to achieve illumination invariance, the cost is to lose some local information. I observed that it actually reduced the performance when there were no illumination changes between images.
The feature threshold used to label a feature in detector may vary from one application to another. When applying my approach to the benchmark sets, threshold 0.01 works well for the other three sets except the bike set. Because there were no features detected at threshold 0.01 for some bike images, I have to decrease the threshold to 0.0006 especially for the bike set. Therefore, it will be helpful to have an adaptive threshold, instead of a fixed one. For example, the algorithm can learn a threshold based on the number of features detected.
Due to the limited time, scale invariance was not considered in my descriptor. A scale-invariant descriptor will make the approach more robust.
Extra credit
I made my feature detector scale invariant by using automatic scale selection. Please refer to the section of detector for details.