‹date/time›

Click to edit Master text styles

Second level

Third level

Fourth level

Fifth level

‹#›

Structure from motion solves the following problem:

Given a set of images of a static scene with 2D points in correspondence, shown here as color-coded points, find…

a set of 3D points P and

a rotation R and position t of the cameras that explain the observed correspondences. In other words, when we project a point into any of the cameras, the reprojection error between the projected and observed 2D points is low.

This problem can be formulated as an optimization problem where we want to find the rotations R, positions t, and 3D point locations P that minimize sum of squared reprojection errors f. This is a non-linear least squares problem and can be solved with algorithms such as Levenberg-Marquart. However, because the problem is non-linear, it can be susceptible to local minima. Therefore, it’s important to initialize the parameters of the system carefully. In addition, we need to be able to deal with erroneous correspondences.

Here’s how the same set of photos appear in our photo explorer. Our system takes the set of photos and automatically determines the relative positions and orientations from which each photo was taken. We can then load the photos into our immersive 3D browser where the user can visualize and explore the photos using spatial relationships.

Our system takes as input an unordered set of photos, either from an Internet search or from a large personal collection. We assume the photos are largely from the same static scene.

The first step of our system is to apply a computer vision techniques to reconstruct the geometry of the scene. The output of this procedure is the relative positions and orientation for the cameras used to take a connected set of the photographs, as well as a point cloud representing the geometry of the scene, and a sparse set of correspondences between the photos.

This information is then loaded into our interactive photo explorer tool.

Again, the goal of the reconstruction procedure is to automatically estimate the relative positions and orientations, as well as the focal length or zoom, of each of the photographs in a connected component of the scene. The process consists of four steps: detecting features in each of the input photos; matching features between each pair of photos; grouping the matches into correspondences across multiple photos; and using the correspondences to estimate the geometry of the scene and the cameras in a technique known as structure from motion.

I’ll now describe each of these steps in more detail.

[Structure from motion techniques for unordered image collections have also been developed by others, such as Brown and Lowe and Schaffalitzky and Zisserman, and the basic pipeline of our system closely follows that of Brown and Lowe. To our knowledge, our system is the first to have been demonstrated on large collections of photos from the Internet.]

The core of the reconstruction process is the ability to reliably match feature points between images. Fortunately, over the last few years feature detection and matching techniques have improved dramatically. Many different technique exist – we use SIFT, or scale-invariant feature transform, developed by David Lowe. SIFT is designed to be invariant to image scale and rotation, as well as affine changes in image intensity.

Here is an image from the Trevi Fountain, and here are the features SIFT detects in the image, shown as square patches. The squares are scaled and rotated to reflect the scale and orientation of the features.

So, we begin by detecting SIFT features in each photo.

[We detect SIFT features in each photo, resulting in a set of feature locations and 128-byte feature descriptors…]

Next, we match features across each pair of photos using approximate nearest neighbor matching.

The matches are refined by using RANSAC, which is a robust model-fitting technique, to estimate a fundamental matrix between each pair of matching images and keeping only matches consistent with that fundamental matrix.

We then link connected components of pairwise feature matches together to form correspondences across multiple images.

Once we have correspondences, we run structure from motion to recover the camera and scene geometry. Structure from motion solves the following problem:

To help get good initializations for all of the parameters of the system, we reconstruct the scene incrementally, starting from two photographs and the points they observe.

We then add several photos at a time to the reconstruction, refine the model, and repeat until no more photos match any points in the scene.

Once we reconstruct a scene, we can view it in our photo explorer. I’ll now describe the explore, starting with a demo of the system.

There are two main components of the system: the navigation controls and the rendering of the scene.

Another way to navigate a photo collection is by using an overhead map. Before we can use this tool, however, we need to align the reconstruction with a map. Here is a video of a user manually aligning a reconstruction, shown in black, with an aerial photograph inside our explorer tool.

Now we’ll move on to another dataset to demonstrate some more of the navigation features of our system.

This is a reconstruction of the Old Town Square in Prague from about 200 photos, shown inside our photo explorer. An overhead map is shown in the upper right corner.

For this dataset, we render the scene using 3D line segments and add color to the scene by projecting blurred, partially transparent versions of the photos onto detected planes.

The user can select photos from the map, and the map tracks the movement of the virtual camera. Here the user selects a building to see a better picture.

Now here’s another example of relation-based browsing – the user can click on the “move right” button to move right to the next building in the row of façades.

Now the user clicks on the zoom out button to see more of the scene.