Project #1: 3D Capture from Images

For this assignment, you will team up with one other classmate and implement a technique for estimating 3D depth information from digital photographs. The main components will be (1) choosing a method to implement, (2) coding it up, (3) deploying it in the lab, and (4), applying your method to create depth scans of a few objects.

We have purchased equipment and arranged support infrastructure for the following 7 shape scanning methods. By 5pm on Thursday, March 29, send us (seitz@cs.washington.edu and curless@cs.washington.edu) email with your preferences, as a ranked ordered list from 1 (top choice) to 7 (bottom choice). Please include all 7 projects. Based on this information, we will choose about 5 projects and choose groups of 2 students for each project. The assignments will be announced on Friday.

We will provide more information soon about equipment and support code and so forth. The purpose of this handout is to give you enough information to choose which method you want to implement. The project will be due Wednesday, April 18.

The possible projects are as follows:

Voxel Coloring

This is a passive vision technique for creating voxel models from several images. The approach works by sweeping a plane (or other surface) through the scene volume and reconstructing voxels on the plane that are consistently colored in the input images. In particular, a voxel is deemed consistent if it projects to pixels whose color variance is below some threshold. Visibility is accounted for by sweeping the plane in a “front-to-back” order. The scene needs to be outside the convex hull of the cameras viewpoints.

Image Acquisition: a simple solution is to put the object on a turntable and rotate it on a rotary turntable in front of a stationary video or still camera, taking several pictures (between 20-100 is reasonable for a 360 degree rotation). The camera would have to be positioned slightly above the object, looking down, to satisfy the convex hull constraint. The camera should be calibrated with respect to the turntable before hand. One way to do this is to take a calibration object like a plane whose 3D geometry is known, place it on the turntable and rotate it to three or four known orientations. From these images, you should be able to obtain the intrinsic and extrinsic parameters using any of a variety of camera calibration routines, and also solve for the rotation axis of the turntable.

References:

S. M. Seitz and C. R. Dyer, Photorealistic Scene Reconstruction by Voxel Coloring, International Journal of Computer Vision, 35(2), 1999, pp. 151-173. Shorter conference paper appeared at Proc. Computer Vision and Pattern Recognition Conf., 1997, pp. 1067-1073.

K. N. Kutulakos and S. M. Seitz, A Theory of Shape by Space Carving, International Journal of Computer Vision, Marr Prize Special Issue, 2000. Earlier version appeared in Proc. Seventh International Conference on Computer Vision (ICCV) , 1999, pp. 307-314.

Stereo

Create a depth map from two views of a scene by correlating pixels between views. In the standard stereo pipeline, the images are first calibrated, using any of a variety of camera calibration routines (you probably want to first place a known “calibration object” in the scene and use a standard routine to find camera parameters, then remove the object). The next step is to “rectify” the images by warping them so that the pixel displacement between images is purely horizontal. This rectification transformation is a planar projective (texture-map) warp and is described in standard vision textbooks such as the Faugeras text (see below). Next, the pixels in each row of the first image are matched with those of the second to determine disparity. There are several methods for performing this search. Since this step is the crucial one, be sure and implement a good technique, such as one of the three methods listed below in the references. It is straightforward to convert disparity into depth, given the known camera parameters.

References:

Three-Dimensional Computer Vision: A Geometric Viewpoint, by Olivier Faugeras, MIT Press.

C. L. Zitnick and T. Kanade, A Cooperative Algorithm for Stereo Matching and Occlusion Detection, Proc. IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 22, no. 7, July 2000.

Web page: http://www.cs.cmu.edu/~clz/stereo.html

D. Scharstein and R. Szeliski. Stereo matching with nonlinear diffusion. International Journal of Computer Vision, 28(2):155-174, July 1998

Y. Boykov, O. Veksler, and Ramin Zabih, Fast Approximate Energy Minimization via Graph Cuts, Unpublished manuscript, 2000.

Swept-stripe triangulation

Swept-stripe triangulation is a technique for capturing the shape of an object by sweeping a plane of light over a surface and observing the reflections with an off-axis camera. This approach has been demonstrated to be one of the most accurate range scanning techniques for applications such as static shape capture for engineering, film, and sculpture.

The plane of light can be formed in several ways, including the use of a video projector, a slide projector, or a laser line source. There are two primary choices for scanning motion:

Fix the light source and camera and move the object on a linear or rotational stage.

Fix the object and camera and move the light source.

The second choice can be implemented with a motion-controlled rotational or translational sweep of the light source or by using a video projector and playing a swept light video sequence.

Reference:

B. Curless and M. Levoy, "Better Optical Triangulation through Spacetime Analysis," Proc. of Intl. Conf. on Computer Vision, pp. 987-994, Boston, June 1995.

Hierarchical stripe scanning

Hierarchical stripe scanning is a robust triangulation method for rapidly determining the shape of an object. The idea is to project a sequence of progressively finer vertical stripes onto the object and observe the refelections with an off-axis camera.

In particular, if you illuminate the object with a pattern that is half white and half black, then each pixel of the camera will observe a white reflection or no reflection. White pixels are known to map to the left half of the object, and black pixels to the right half. You can now subdivide the projection so that there are four stripes (white, black, white, black) and take another image. If a pixel was observed to be white the first time and black the second, then that pixel's line of sight must intersect the second stripe, which is half as thick as the stripe used for the first image. Continuing this refinement process, each pixel will record a bit code (a sequence of white and black observations) that will ultimately correspond to a particular stripe at the finest level. Triangulating with that stripe gives the desired range point for that pixel.

Reference:

Sato, K. and Inokuchi, S., Three-dimensional surface measurement by space encoding range imaging, Journal of Robotic Systems 2 (1985), pp. 27-39.

One-shot active triangulation

"One-shot" active triangulation is a method for projecting a single grid pattern onto an object and determining its shape. Making such a technique robust take some effort and a set of assumptions, but the result is a method for capturing real-time deformations, such as a human face making a variety of expressions.

The idea is to project a grid pattern, e.g., a rectilinear grid, onto a surface and observe the reflection with an off-axis camera. Next, you process the image to detect horizontal and predominantly vertical edge segments. You then link the edge segments to form a graph that represents a deformed version of the grid. The deformation (which may include breaks in the grid due to occlusion) corresponds to the shape of the object. In the end, the positions of the extracted grid lines can be triangulated with the original projector grid lines.

Note that Steve and Brian have a colored grid projection technique in mind that should substantially reduce the difficulty in determining the edge connectivity of the graph. Still, for reference, we have included pointers to a technique that uses a single color (white) grid.

References:

M. Proesmans, L. Van Gool, and A. Oosterlinck. Active acquisition of 3D shape for Moving Objects. In International Conference on Image Processing, Lausanne, Switzerland, September 1996.

M. Proesmans, L. Van Gool, A. Oosterlinck: One-shot active 3d shape acquisition in Proceedings 13th IAPR International conference on pattern recognition robotic systems, vol. III C, pp. 336--340, August 25--26, 1996, Vienna, Austria.

Photometric stereo

Photometric stereo is a technique for determining the normals and reflectances over a surface. Under certain assumptions, it is a fairly simple technique and yields data which can be used, e.g., for bump-mapping surfaces.

Here's how it works. Consider taking three images of a surface in succession, where a single point light source illuminates the surface for each image. For a given pixel, you will observe three values. If you assume the surface is diffuse with reflectance R, then you will observe:

I_i = R * S_i l_i . n

where S_i is the "intensity" of the i-th light source, l_i is the lighting direction, and n is the normal. We can re-write this as:

I_i = L_i . N

where L_i = S_i l_i and N = R * n. Using column vectors, our measurement equations can be combined into a matrix:

| (L_1)^T |

| (L_2)^T | N = I

| (L_3)^T |

where I = (I_1, I_2, I_3)^T. This equation can be solved for the normal and reflectance.

References:

Woodham, R.J.: Photometric method for determining surface orientations from multiple images. Optical Engineering 19, 139-144 (1980).

H. Rushmeier, G. Taubin, and A. Gueziec, Applying shape from lighting variation to bump map capture, Eurographics Rendering Workshop 1997 (P. Slusallek J. Dorsey, ed.), Springer Wien, June 1997, pp. 35--44.

Shadow Scanning

In this approach, a stick is waved by hand under a light source, casting a shadow stripe onto the surface of the object to be scanned. As the stick moves, the shadow slides across the object and is imaged by a video camera. The image of the shadow reveals the shape of the object under the shadow. In particular, each “edge” of the shadow lies on a plane determined by the light source and the stick. This plane may be identified by ensuring that part of the shadow hits a known ground plane and calibrating the light source beforehand. Given the plane and the shadow edge in the image, the 3D profile of the shadow edge is readily computed.

Getting this working requires calibrating the light source, implementing a shadow edge detector, and setting up a good working environment where the shadows will show up well.

This technique was developed by Jean-Yves Bouguet and Pietro Perona. Check out Bouguet’s web page at Caltech for more information:

http://www.vision.caltech.edu/bouguetj/ICCV98/index.html