Skip to main content

Depth estimation in images and videos is an important task in many computer vision problems and is often a prerequisite in the field of robotics and autonomous driving. While the task of estimating the depth of a static scene with multiple cameras is widely regarded as solved, estimating depth using only monocular images still provides a significant challenge. The aim of this project is to explicitly model monocular depth cues, which are inspired by human depth perception. Furthermore, we seek to elaborate on the potential of using relative instead of absolute depth information.

The aim of the cooperation project MonoTo3D between the L3S/TIB and the University of Paderborn is to develop new methods for monocular depth estimation. Today, most methods employ artificial neural networks for tackling this problem. In order to train these methods, often datasets are used, which were acquired by using laser technologies like LIDAR or the Kinect camera.

 

This is problematic in multiple aspects. First, gathering training data in this way is expensive. Furthermore, the acquired data is often noisy and sometimes even plain wrong. Moreover, the problem of gauging the absolute distance between camera and scene is unnecessarily complicated, because often the images themselves do not offer all the information needed.

 

Because of that, this project will pursue another strategy. First, depth cues will be explicitly modeled and incorporated into the depth estimation task, which are motivated by the way humans perceive depth. These include for example cues like linear perspective, relative sizes and occlusion. Furthermore, depth estimation will not be regarded as a regression problem, which aims to predict the absolute depth for each pixel, but as a learning to rank problem. Here, for two image regions the goal is estimate the ordinal interrelationship (“is nearer”, “is further away”). This is advantageous in multiple ways. For one, it reduces the complexity a lot, because only the ordinal relative relationship has to be predicted, which is often sufficient for many practical applications. In addition, it is a lot easier to gather sufficient training data, because humans are very good at perceiving these relative depth differences in images.

Begin
End