User:Paulpop
A Manhattan world scene is a term used in computer vision, that describes a real world scene based on the Cartesian coordinate system . The scene is defined by four types of lines: random lines or lines parallel with one of the X, Y or Z axes.
Introduction
[edit]The Manhattan World assumption was first discussed by Coughlan and Yuille[1]. They stated that many random visual scenes are based on a three-dimensional Manhattan grid, which should impose regularities in the image. Urban scenes are more likely to comply with the Manhattan world assumption, because they have high statistical regularities due to the predominant straight lines in the scenes. Although one may expect the Manhattan scenes to be strictly related to urban scenes, the same authors argue that some rural scenes contain enough structure on the distribution of edges to provide a natural Cartesian reference frame for the viewer[2].
Theory
[edit]Manhattan world
[edit]Most urban and some indoor scenes (usually man-made scenes) are based on a Cartesian coordinate system which we can refer to as a Manhattan grid. The ground plane is represented by X and Y axes, while the vertical direction is given by the Z axis. Images are formed by perspective projection (see pinhole camera model) of a three-dimensional scene. Lines that are parallel in the image (such as sides of window frames, sides of buildings etc.) are projected to straight lines that converge to a vanishing point. The structure edges in Manhattan scenes usually occur in the X,Y and Z directions. It comes natural to humans to estimate the three-dimensional structure of the scene based on this information. We also interpret the properties of the images regarding their three-dimensional structure and not their apparent size in the image (e.g. relative position of objects, size of people). Based on our assumption of the Manhattan scene we can observe the Ames room visual illusion, where people appear to be changing size while moving in the camera view.
Calculating relative camera position in Manhattan scenes
[edit]Knowing the position of the viewer relative to the frame allows a much easier interpretation of the scene. It becomes much easier to determine the important lines in the scene, such as the street/building boundaries or corridor lines. Having this reference frame allows a much easier and faster detection of outliers in the Manhattan world.
It is assumed that the camera direction lies in the horizontal plane, as it is in most images. Then the Manhattan grid defines an ~i;~j;~k coordinate system. The lines in the ~k direction would map to approximately vertical lines in the projected image. However, there is an ambiguity in the ~i and ~j orientations, because of the compass heading which can only be obtained modulo 90.
- Projection to 2D
Project 3-D lines from a Manhattan scene to an UV plane. We consider to be three orthogonal vectors representing the three axes in the Cartesian coordinate system and UV the plane where we want to project the point , and the focal length of camera is f.
Camera axes defined by three orthogonal unit vectors which are specified by 3 Euler angles. Ψ = (α, β, γ): azimuth, elevation and twist.
- Outliers in Manhattan world
Use Manhattan model to find outliers, pixels that are unaligned to the Manhattan frame. The model was used by Coughlan to identify odd elements in Manhattan images (a bike, a robot)[4]]].
- Speculation
Manhattan edge classification could simplify stereo matching[5] Basic idea: edge classification at each pixel constraints possible matchings along epipolar lines. An X pixel in left image should match an X pixel in right image, not a Y or Z pixel such that many possible matches could be ruled out.
Applications
[edit]Automatic Camera Calibration from a Single Manhattan Image
[edit]- The Manhattan model is used to estimate the camera pose, calibration parameters and focal length.
- Introducing a stochastic search algorithm.
Bayesian algorithms for autonomous vision systems
[edit]- The Manhattan algorithm is used to guide an autonomous robot vehicle, using an artificial retina.[6]
Extracting 3D information from single images
[edit]- Exploiting the Manhattan assumption to obtain 3D reconstructions from a single image.
- Extracting ambiguous features such as image depth and identify objects.
- Generating views of interior rooms[7]
- Automated model generator of building interiors[8]
See also
[edit]Related research papers
[edit]- Building Reconstruction using Manhattan-World Grammars, (Vanegas et al., 2010)
- Automatic camera calibration from a single Manhattan image (Deutscher et al., 2002)
- Manhattan-World Stereo (Furukawa et al., 2009)
- Geometric Reasoning for Single Image Structure Recovery (Lee et al., 2009)
Methods that use edge detection, Hough transforms, not Manhattan models:
- Finding vanishing points in a single image (Brillault-O'Mahony 1991; Lutton, Maitre, Lopez-Krahe 1994; Shufelt 1999.)
- Finding vanishing points in context of Manhattan-type assumption of dominant directions in a scene (J. Kozecká and W. Zhang, “Video Compass.” ECCV ’02)
References
[edit]- ^ Manhattan World: Compass Direction from a Single Image by Bayesian Inference., Coughlan and Yuille, 1999.
- ^ The Manhattan World Assumption: Regularities in scene statistics which enable Bayesian inference., Coughlan and Yuille, 2000.
- ^ [1] Manhattan Talk, James M. Coughlan, Smith-Kettlewell Eye Research Institute
- ^ [2] Manhattan Talk, James M. Coughlan, Smith-Kettlewell Eye Research Institute
- ^ Manhattan World: Orientation and Outlier Detection by Bayesian Inference, Coughlan and Yuille, 2003.
- ^ Burgi. “Bayesian algorithms for autonomous vision systems.” Technical Report 1040, Swiss Center for Electronics and Microtechnology, Neufchatel, Switzerland. In preparation. 2003.
- ^ [3]Lee D.C., Hebert M., Kanade T., Geometric reasoning for single image structure recovery. IEEE CVPR, 2009.
- ^ [4]Furukawa, Y., Curless, B., Seitz, S.M., Szeliski, R. Reconstructing Building Interiors from Images. ICCV, 2009.