SIFT
Guest Lecture by Jiwon Kim
http://www.cs.washington.edu/homes/jwkim/

SIFT Features and
Its Applications

Autostitch Demo

Autostitch
Fully automatic panorama generation
Input: set of images
Output: panorama(s)
Uses SIFT (Scale-Invariant Feature Transform) to find/align images

1. Solve for homography

1. Solve for homography

1. Solve for homography

2. Find connected sets of images

2. Find connected sets of images

2. Find connected sets of images

3. Solve for camera parameters
New images initialised with rotation, focal length of best matching image

3. Solve for camera parameters
New images initialised with rotation, focal length of best matching image

4. Blending the panorama
Burt & Adelson 1983
Blend frequency bands over range µ l

2-band Blending

Linear Blending

2-band Blending

So, what is SIFT?
Scale-Invariant Feature Transform
David Lowe at UBC
Scale/rotation invariant
Currently best known feature descriptor
Many real-world applications
Object recognition
Panorama stitching
Robot localization
Video indexing

Example: object recognition

SIFT properties
Locality: features are local, so robust to occlusion and clutter
Distinctiveness: individual features can be matched to a large database of objects
Quantity: many features can be generated for even small objects
Efficiency: close to real-time performance

SIFT algorithm overview
Feature detection
Detect points that can be repeatably selected under location/scale change
Feature description
Assign orientation to detected feature points
Construct a descriptor for image patch around each feature point
Feature matching

1. Feature detection
Detect points stable under location/scale change
Build continuous space (x, y, scale)
Approximated by multi-scale Difference-of-Gaussian pyramid
Select maxima/minima in (x, y, scale)

1. Feature detection

1. Feature detection
Localize extrema by fitting a quadratic
Sub-pixel/sub-scale interpolation using Taylor expansion
Take derivative and set to zero

1. Feature detection
Discard low-contrast/edge points
Low contrast: discard keypoints with        < threshold
Edge points: high contrast in one direction, low in the other ŕ compute principal curvatures from eigenvalues of 2x2 Hessian matrix, and limit ratio

1. Feature detection
Example

2. Feature description
Create histogram of local gradient directions computed at selected scale
Assign canonical orientation at peak of smoothed histogram

2. Feature description
Construct SIFT descriptor
Create array of orientation histograms
8 orientations x 4x4 histogram array = 128 dimensions

2. Feature description
Advantage over simple correlation
Gradients less sensitive to illumination change
Gradients may shift: robust to deformation, viewpoint change

Performance: stability to noise
Match features after random change in image scale & orientation, with differing levels of image noise
Find nearest neighbor in database of 30,000 features

Performance:
stability to affine change
Match features after random change in image scale & orientation, with 2% image noise, and affine distortion
Find nearest neighbor in database of 30,000 features

Performance: distinctiveness
Vary size of database of features, with 30 degree affine change, 2% image noise
Measure % correct for single nearest neighbor match

3. Feature matching
For each feature in A, find nearest neighbor in B

3. Feature matching
Nearest neighbor search too slow for large database of 128-dimenional data
Approximate nearest neighbor search:
Best-bin-first [Beis et al. 97]: modification to k-d tree algorithm
Use heap data structure to identify bins in order by their distance from query point
Result: Can give speedup by factor of 1000 while finding nearest neighbor (of interest) 95% of the time

3. Feature matching
Reject false matches
Compare distance of nearest neighbor to second nearest neighbor
Common features aren’t distinctive, therefore bad
Threshold of 0.8 provides excellent separation

3. Feature matching
Now, given feature matches…
Find an object in the scene
Solve for homography (panorama)

3. Feature matching
Example: 3D object recognition

3. Feature matching
3D object recognition
Assume affine transform: clusters of size >=3
Looking for 3 matches out of 3000 that agree on same object and pose: too many outliers for RANSAC or LMS
Use Hough Transform
Each match votes for a hypothesis for object ID/pose
Voting for multiple bins & large bin size allow for error due to similarity approximation

3. Feature matching
3D object recognition: solve for pose
Affine transform of [x,y] to [u,v]:
Rewrite to solve for transform parameters:

3. Feature matching
3D object recognition: verify model
Discard outliers for pose solution in prev step
Perform top-down check for additional features
Evaluate probability that match is correct
Use Bayesian model, with probability that features would arise by chance if object was not present
Takes account of object size in image, textured regions, model feature count in database, accuracy of fit [Lowe 01]

Planar recognition
Training images

Planar recognition

3D object recognition
Training images

3D object recognition
Only 3 keys are needed for recognition, so extra keys provide robustness
Affine model is no longer as accurate

Recognition under occlusion

Illumination invariance

Applications of SIFT
Object recognition
Panoramic image stitching
Robot localization
Video indexing
The Office of the Past
Document tracking and recognition

Location recognition

Robot Localization

Map continuously built over time

Locations of map features in 3D

Slide 51

The Office of the Past
Paper everywhere

Unify physical and
electronic desktops
Recognize video of paper on physical desktop
Tracking
Recognition
Linking

Unify physical and
electronic desktops
Applications
Find lost documents
Browse remote desktop
Find electronic version
History-based queries

Example input video

Demo – Remote desktop

System overview

System overview

System overview

System overview

System overview

System overview

System overview

System overview

Assumptions
Document
Corresponding electronic copy exists
No duplicates of same document

Assumptions
Document
Corresponding electronic copy exists
No duplicates of same document
Motion
3 event types: move/entry/exit
One document at a time
Only topmost document can move

Non-assumptions
Desk need not be initially empty

Non-assumptions
Desk need not be initially empty
Stacks may overlap

Algorithm overview

Algorithm overview

Algorithm overview

Algorithm overview

Algorithm overview

Algorithm overview

Document tracking example

Document tracking example

Document tracking example

Document tracking example

Document tracking example

Document tracking example

Document tracking example

Document tracking example

Document tracking example

Document tracking example

Document Recognition
Match against PDF image database

Document Recognition
Performance analysis
Tested 20 pages against database of 162 pages

Document Recognition
Performance analysis
Tested 20 pages against database of 162 pages
~200x300 pixels per document for reliable match

Document Recognition
Performance analysis
Tested 20 pages against database of 162 pages
~200x300 pixels per document for reliable match

Results
Input video
~40 minutes
1024x768 @ 15 fps
22 documents, 49 events
Running time
Video processed offline
No optimization
A few hours for entire video

Demo – Paper tracking

Photo sorting example

Photo sorting example

Demo – Photo sorting

Future work
Enhance realism
Handle more realistic desktops
Real-time performance
More applications
Support other document tasks
E.g., attach reminder, cluster documents
Beyond documents
Other 3D desktop objects, books/CD’s

Summary
SIFT is:
Scale/rotation invariant local feature
Highly distinctive
Robust to occlusion, illumination change, 3D viewpoint change
Efficient (real-time performance)
Suitable for many useful applications

References
Distinctive image features from scale-invariant keypoints
David G. Lowe, International Journal of Computer Vision, 60, 2 (2004), pp. 91-110
Recognising panoramas
Matthew Brown and David G. Lowe, International Conference on Computer Vision (ICCV 2003), Nice, France (October 2003), pp. 1218-25.
Video-Based Document Tracking: Unifying Your Physical and Electronic Desktops
Jiwon Kim, Steven M. Seitz and Maneesh Agrawala, ACM Symposium on User Interface Software and Technology (UIST 2004), pp. 99-107.