|
1
|
- Guest Lecture by Jiwon Kim
- http://www.cs.washington.edu/homes/jwkim/
|
|
2
|
|
|
3
|
|
|
4
|
- Fully automatic panorama generation
- Input: set of images
- Output: panorama(s)
- Uses SIFT (Scale-Invariant Feature Transform) to find/align images
|
|
5
|
|
|
6
|
|
|
7
|
|
|
8
|
|
|
9
|
|
|
10
|
|
|
11
|
- New images initialised with rotation, focal length of best matching
image
|
|
12
|
- New images initialised with rotation, focal length of best matching
image
|
|
13
|
- Burt & Adelson 1983
- Blend frequency bands over range µ l
|
|
14
|
|
|
15
|
|
|
16
|
|
|
17
|
- Scale-Invariant Feature Transform
- David Lowe at UBC
- Scale/rotation invariant
- Currently best known feature descriptor
- Many real-world applications
- Object recognition
- Panorama stitching
- Robot localization
- Video indexing
- …
|
|
18
|
|
|
19
|
- Locality: features are local, so robust to occlusion and clutter
- Distinctiveness: individual features can be matched to a large database
of objects
- Quantity: many features can be generated for even small objects
- Efficiency: close to real-time performance
|
|
20
|
- Feature detection
- Detect points that can be repeatably selected under location/scale
change
- Feature description
- Assign orientation to detected feature points
- Construct a descriptor for image patch around each feature point
- Feature matching
|
|
21
|
- Detect points stable under location/scale change
- Build continuous space (x, y, scale)
- Approximated by multi-scale Difference-of-Gaussian pyramid
- Select maxima/minima in (x, y, scale)
|
|
22
|
|
|
23
|
- Localize extrema by fitting a quadratic
- Sub-pixel/sub-scale interpolation using Taylor expansion
- Take derivative and set to zero
|
|
24
|
- Discard low-contrast/edge points
- Low contrast: discard keypoints with < threshold
- Edge points: high contrast in one direction, low in the other ŕ compute principal curvatures
from eigenvalues of 2x2 Hessian matrix, and limit ratio
|
|
25
|
|
|
26
|
- Create histogram of local gradient directions computed at selected
scale
- Assign canonical orientation at peak of smoothed histogram
|
|
27
|
- Construct SIFT descriptor
- Create array of orientation histograms
- 8 orientations x 4x4 histogram array = 128 dimensions
|
|
28
|
- Advantage over simple correlation
- Gradients less sensitive to illumination change
- Gradients may shift: robust to deformation, viewpoint change
|
|
29
|
- Match features after random change in image scale & orientation,
with differing levels of image noise
- Find nearest neighbor in database of 30,000 features
|
|
30
|
- Match features after random change in image scale & orientation,
with 2% image noise, and affine distortion
- Find nearest neighbor in database of 30,000 features
|
|
31
|
- Vary size of database of features, with 30 degree affine change, 2%
image noise
- Measure % correct for single nearest neighbor match
|
|
32
|
- For each feature in A, find nearest neighbor in B
|
|
33
|
- Nearest neighbor search too slow for large database of 128-dimenional
data
- Approximate nearest neighbor search:
- Best-bin-first [Beis et al. 97]: modification to k-d tree algorithm
- Use heap data structure to identify bins in order by their distance
from query point
- Result: Can give speedup by factor of 1000 while finding nearest
neighbor (of interest) 95% of the time
|
|
34
|
- Reject false matches
- Compare distance of nearest neighbor to second nearest neighbor
- Common features aren’t distinctive, therefore bad
- Threshold of 0.8 provides excellent separation
|
|
35
|
- Now, given feature matches…
- Find an object in the scene
- Solve for homography (panorama)
- …
|
|
36
|
- Example: 3D object recognition
|
|
37
|
- 3D object recognition
- Assume affine transform: clusters of size >=3
- Looking for 3 matches out of 3000 that agree on same object and pose:
too many outliers for RANSAC or LMS
- Use Hough Transform
- Each match votes for a hypothesis for object ID/pose
- Voting for multiple bins & large bin size allow for error due to
similarity approximation
|
|
38
|
- 3D object recognition: solve for pose
- Affine transform of [x,y] to [u,v]:
- Rewrite to solve for transform parameters:
|
|
39
|
- 3D object recognition: verify model
- Discard outliers for pose solution in prev step
- Perform top-down check for additional features
- Evaluate probability that match is correct
- Use Bayesian model, with probability that features would arise by
chance if object was not present
- Takes account of object size in image, textured regions, model feature
count in database, accuracy of fit [Lowe 01]
|
|
40
|
|
|
41
|
|
|
42
|
|
|
43
|
- Only 3 keys are needed for recognition, so extra keys provide robustness
- Affine model is no longer as accurate
|
|
44
|
|
|
45
|
|
|
46
|
- Object recognition
- Panoramic image stitching
- Robot localization
- Video indexing
- …
- The Office of the Past
- Document tracking and recognition
|
|
47
|
|
|
48
|
|
|
49
|
|
|
50
|
|
|
51
|
|
|
52
|
|
|
53
|
- Recognize video of paper on physical desktop
- Tracking
- Recognition
- Linking
|
|
54
|
- Applications
- Find lost documents
- Browse remote desktop
- Find electronic version
- History-based queries
|
|
55
|
|
|
56
|
|
|
57
|
|
|
58
|
|
|
59
|
|
|
60
|
|
|
61
|
|
|
62
|
|
|
63
|
|
|
64
|
|
|
65
|
- Document
- Corresponding electronic copy exists
- No duplicates of same document
|
|
66
|
- Document
- Corresponding electronic copy exists
- No duplicates of same document
- Motion
- 3 event types: move/entry/exit
- One document at a time
- Only topmost document can move
|
|
67
|
- Desk need not be initially empty
|
|
68
|
- Desk need not be initially empty
- Stacks may overlap
|
|
69
|
|
|
70
|
|
|
71
|
|
|
72
|
|
|
73
|
|
|
74
|
|
|
75
|
|
|
76
|
|
|
77
|
|
|
78
|
|
|
79
|
|
|
80
|
|
|
81
|
|
|
82
|
|
|
83
|
|
|
84
|
|
|
85
|
- Match against PDF image database
|
|
86
|
- Performance analysis
- Tested 20 pages against database of 162 pages
|
|
87
|
- Performance analysis
- Tested 20 pages against database of 162 pages
- ~200x300 pixels per document for reliable match
|
|
88
|
- Performance analysis
- Tested 20 pages against database of 162 pages
- ~200x300 pixels per document for reliable match
|
|
89
|
- Input video
- ~40 minutes
- 1024x768 @ 15 fps
- 22 documents, 49 events
- Running time
- Video processed offline
- No optimization
- A few hours for entire video
|
|
90
|
|
|
91
|
|
|
92
|
|
|
93
|
|
|
94
|
- Enhance realism
- Handle more realistic desktops
- Real-time performance
- More applications
- Support other document tasks
- E.g., attach reminder, cluster documents
- Beyond documents
- Other 3D desktop objects, books/CD’s
|
|
95
|
- SIFT is:
- Scale/rotation invariant local feature
- Highly distinctive
- Robust to occlusion, illumination change, 3D viewpoint change
- Efficient (real-time performance)
- Suitable for many useful applications
|
|
96
|
- Distinctive image features from scale-invariant keypoints
- David G. Lowe, International Journal of Computer Vision, 60, 2 (2004),
pp. 91-110
- Recognising panoramas
- Matthew Brown and David G. Lowe, International Conference on Computer
Vision (ICCV 2003), Nice, France (October 2003), pp. 1218-25.
- Video-Based Document Tracking: Unifying Your Physical and Electronic
Desktops
- Jiwon Kim, Steven M. Seitz and Maneesh Agrawala, ACM Symposium on User
Interface Software and Technology (UIST 2004), pp. 99-107.
|