The goal of the envisaged project is the 3D holistic scene understanding from monocular, stereo or multi-camera image sequences. Scene understanding involves many sub-tasks, such as scene labeling, object detection and tracking, object reasoning, and the analysis of object behavior. Each of these sub-tasks explains some aspect of a particular scene. In order to fully understand a scene, all of them need to be carried out.
In real scenarios such as dynamic urban scenes, a shopping mall or a football match several sub-tasks are naturally coupled. Therefore, the potential of an integrated scheme will be explored,. It consists of a low-level scene labeling model, a mid-level object reasoning model, and a high-level behavior analysis model to build a context-aware scene description. This integrated 3D model will take advantage of both dynamic and static information as well as static information coming from semantic labels and geometry. The novelty is threefold: (1) The three levels, each having its own appearance and its own geometric and semantic model of the scene, are meant to smoothly interact using statistically-based interfaces to arrive at consistent descriptions of the scene. (2) Discovery and learning of object behavior context for context-aware behavior modeling are investigated for image sequences. (3) The build-up of the scene description provides a smart integration of bottom-up and top-down reasoning and allows to incorporate prior knowledge and boost performance.
Challenges & Highlights
Typically, scenes are composed of a static environment and dynamic objects, e. g., vehicles driving on roads, which are located between buildings, and pedestrians walking on sidewalks, which are surrounded by objects. Therefore, objects should not be modeled in isolation but in their 3D-scene context, which puts strong constraints on the position and motion of objects. From a high-level reasoning point of view, the visual context is very important. Humans employ the visual context extensively for behavior recognition in a dynamic crowded environment. To understand object behavior (vehicle or pedestrian) in a crowded environment, the most relevant visual context are the non-rigid relationships among co-existing objects in the same scene.
Potential applications & future issues
The project will develop methods to perform 3D semantic representations of scene understanding which are suitable in the area of surveillance, autonomous driving, map enhancement, etc.