0.B2.3 Temporal development of Dynamic vision
When ‚computer vision’ or ‚machine vision’ became possible in the last half of the twentieth century due to the advent of powerful digital (micro-) processors, several avenues to its development opened up. The starting point was processing of single digital images for exploration of unknown or not sufficiently well known spaces like the surfaces of the moon or other planets, inaccessible environments in agriculture, geodesy or battle field territory.
With computing power developing at a rate of about one order of magnitude every four to five years since the advent of the microprocessor in the early 1970s, application areas rapidly spread to a wide range; the following figure shows types of vision systems with those components marked by bold letters that were selected as starting point for dynamic vision in the 1980s:
The usual approach to computer vision with the limited computing power available in the early 1980s (typical: clock rate of a few MHz, 8- and 16-bit microprocessors), was to start developing feature extraction and image interpretation from single snapshots or a small sequence of images stored in memory. Simple evaluation of a single small image needed several seconds to minutes. The hope was that with the development of computing power this time would shrink to tens of milliseconds (needed speed-up of ~ 3 orders of magnitude, i.e. ~ 1½ decades). Big development projects were started like DARPA’s ‘On Strategic Computing’ to investigate new computer architectures with massively parallel processors [Klass 1985; Kuan et al. 1986; Thorpe et al. 1987; Scudder and Weems 1990; Hillis 1992; Roland and Shiman 2002]. Spatial interpretation had to be done by inversion of perspective projection of images taken from different view points (sets of two or more cameras). Intermediate interpretation steps like ‘2.5-D vision’ were chosen as a common approach [Waltz 1975; Marr 1982].
Fundamentally in contrast to this generally accepted procedure needing storage of image sequences for real-time vision, ‘dynamic vision’ started from recursive estimation algorithms [Kalman 1960; Luenberger 1964] with spatiotemporal models for motion processes of objects to be observed [Dickmanns 1987]; perspective projection was considered just a nonlinear measurement process with some process states not directly accessible (like depth or range). Since 3-D space was an integral part of the model together with constraints to temporal changes (time as fourth dimension), the approach was dubbed ‘4-D approach’ to real-time vision. One big advantage right from the beginning was that no previous images needed be stored, even for stereo interpretation (motion stereo).
State of recursive estimation in the AI- and Computer Science community (1987)
At that time, Kalman filtering had acquired a bad reputation for image sequence evaluation. The temporal continuity models typically applied were formulated in the image plane (2-D) and not for the physical world where they really hold; feature speed was assumed constant in each direction in the image, usually, and for adaptation a noise term was added. The nonlinear perspective mapping conditions were locally approximated, and the entries in the Jacobian matrices were selected as some reasonably looking constants. The convergence process was claimed to be brittle and not robust. Disappearance of features was dubbed a ‘catastrophic event’ and not considered as a normal event in a more complex situation with self-occlusion by rotation of the body or by motion behind another object. So the performance achieved by our group in the early 1980s with realistic physical models including full perspective mapping (pinhole camera) as nonlinear measurement model in a real-time process came as a surprise.
The 4-D approach to dynamic vision
Space with three orthogonal components and time are the four dimensions (4-D) for motion in six degrees of freedom (dof); this is the sufficiently general formulation for our everyday (meso-scale) world. [Only in both galactic and nano-scale spaces (processes) have higher dimensions shown advantages in modeling.] − In order to have the entire team think in terms of real-time process observation, a lower limit for image interpretation frequency has been set to about 10 Hz right from the beginning (0.1 second for each frame in a sequence analyzed). The vision tasks had to be selected correspondingly, but it soon turned out that the short time intervals available allowed developing a different look at visual perception and process control:
Applications to ground-, air- and space vehicles have been developed at UniBwM over the last quarter of the 20th century. The following visualization of the 4-D approach has emerged over the years
In the meantime, using recursive estimation with sufficiently correct dynamic (physical) models and perspective mapping has become standard and commonplace around the world in real-time vision; the term ‘4-D approach’ is rarely used though the approaches follow the same ideas, essentially. In a stationary environment, where all objects observed except the vehicle carrying the camera remain in fix positions and orientations over time, the general problem reduces to the task domain in the field of geodesy well known under the label ‘photogrammetry’: Given several snapshots of an environment, what are the locations of objects seen in the images, and what are the locations from which the snapshots were taken. For image sequences taken with a camera on a moving vehicle (or carried by a person), some constraints on (vehicle) motion may be exploited in addition; for this field a new name was coined in the early 1990’s.
SLAM and 6D – SLAM: In the early 1990s, the label ‘Simultaneous Localization And Mapping’ (short SLAM) was propagated [Leonard and Durrant-Whyte 1991] for a specialized set of tasks [Ayache and Faugeras 1989]. It addresses the long existing challenge in the field of photogrammetry: Given a sequence of images from a static environment, determine the relative location of the camera and of objects in the images.
Initially, only translational variables and one angle around the vertical axis for (the vehicle carrying) the camera have been used; motion models for camera movements were predominantly kinematic [Guivant and Nebot 2001; Montemerlo 2003; Haehnel et al. 2003; Nieto et al. 2003]. For the objects observed, range and bearing (or corresponding Cartesian coordinates) have been introduced as state variables to be iterated. Changes of aspect conditions on visual appearance have been neglected initially.
Later on, when object orientation around up to three
axes or velocity components were taken into account additionally, the term “6-D
SLAM” has been proposed and widely accepted [Nuechter et al. 2004; Nuechter
et al. 2006]; the capital letter
“D” is not to be confused with ‘Dimension’ (as in 4-D) but stands for ‘Degrees-of-freedom’ (in engineering
terms generally dubbed ‘dof’). But also in
6D-SLAM mechanical motion is not modelled as a second-order process with
position and speed components as state variables (according to
Advantages of the 4-D approach
The essential aspect
When both the body carrying the camera and other objects observed through images are moving, inertial sensing of linear accelerations and rotation rates of the own body allows separating eigenmotion from object motion. Together with recursive visual estimation of motion of the objects relative to the projection center of the camera, one can arrive at reasonable estimates for the absolute motion of objects. However, since egostate estimation by integration of inertial signals over extended periods of time tends to generate drift errors caused by sensor noise, stabilization by vision would be most welcome. If stationary objects can be recognized in the image sequence, they allow this low-frequency stabilization for which the delay time in image understanding (usually 100 milliseconds or more, i.e. several video cycles) is much less serious than for full state estimation at video rate.
In addition, inertial signals contain the acceleration effects of perturbations on the body immediately, while in vision the effects of perturbations show up with relatively large delays. The reason is that perspective mapping is based on position as the second integral of accelerations. – Nature has developed and implemented this principle in vertebrate vision for millions of years.
There is a second beneficial effect to sensing of rotation rates: If these signals measured right at the location of the vision sensor are used to control gaze direction by negative feedback, the angular perturbations on the vision sensors can be reduced by more than an order of magnitude [Schiehlen 1995; Dickmanns 2007, page 382].
Video 16 → VaMoRs BrakingGazeStabilization
Note that the effects of position changes of a camera (translational states) are much less pronounced in an image for objects further away (needed for angular stabilization); therefore, usually the acceleration signals are taken only for speed estimation and for improving relative state estimation to objects nearby.
Once gaze control is available in the vision system, it can also be used to lock gaze onto a moving object by direct feedback of visual features. This allows reducing motion blur for this object (though inducing more blur for other objects moving in opposite direction); in a well developed vision system this leads to active attention control depending on the situation encountered. In this case, multi-focal vision systems with different focal lengths simultaneously in operation are of advantage. The internal representation of speed components for all objects of interest can improve attention control by short-term predictions and longer-term expectations based on typical motion behavior for object classes.
All these aspects
together have led to the development of “Expectation-based, Multi-focal, Saccadic (
References (in temporal order)
Luenberger DG (1964). Observing the state of a linear system. IEEE Trans. on Military Electronics 8: 74-80; 290-293.
(1975). Understanding line drawings of scenes with shadows.
In: The Psychology of Computer Vision,
Winston, PH (ed),
Marr D (1982). Vision. Freeman,
Meissner HG, Dickmanns
ED (1983). Control of an Unstable Plant by Computer Vision.
In T.S. Huang (ed): Image Sequence Processing and
Dynamic Scene Analysis.
Klass P.J. (1985): DARPA Envisions New Generation of Machine Intelligence. Aviation Week & Space Technology, April: 47–54
Kuan D., Phipps G., Hsueh
A.C. (1986): A real time road following vision system for autonomous vehicles.
Proc. SPIE Mobile Robots Conf., 727,
Thorpe C., Hebert M., Kanade T., Shafer S. (1987): Vision and navigation for the CMU Navlab. Annual Review of Computer Science, Vol. 2
Dickmanns ED (1987). 4-D
Dynamic Scene Analysis with Integral Spatio-Temporal Models. 4th Int. Symposium on Robotics Research,
Smith R, Self M, Cheeseman P (1987). A stochastic map for
uncertain spatial relationships. Autonomous
Ayache N, Faugeras O (1989). Maintaining a representation of the environment of a mobile robot. IEEE Transactions on Robotics and Automation, 5(6), pp 804-819.
Scudder M., Weems C.C. (1990): An Apply Compiler for the CAAPP. Tech.
Leonard JJ, Durrant-Whyte HF (1991). Simultaneous map building and localization for an autonomous mobile
robot. In: Proceedings of the IEEE Int. Workshop on Intelligent Robots
Hillis W.D. (1992) (6th
printing): The Connection Machine. MIT Press,
Schiehlen J (1995). Kameraplattformen für aktiv sehende Fahrzeuge. Dissertation, UniBwM / LRT. Also as Fortschritts-berichte VDI Verlag, Reihe 8, Nr. 514
Guivant J, Nebot E (2001). Optimization of the simultaneous localization and map building algorithm for real-time implementation. IEEE Transactions on Robotics and Automation, 17, pp 41-76.
Roland A., Shiman P. (2002): Strategic Computing: DARPA and the Quest for Machine Intelligence, 1983–1993. MIT Press
M (2003). FastSLAM: A Factored Solution to the
Simultaneous Localization and Mapping Problem with Unknown Data Association. PhD thesis, Robotics Institute,
Nieto J, Guivant J, Nebot E, Thrun S (2003). Real time data association for Fast-SLAM. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA).
Haehnel D, Fox D, Burgard W, Thrun S (2003). A highly efficient FastSLAM algorithm for generating cyclic maps of large-scale environments from raw laser range measurements. In Proc. of the Conference on Intelligent Robots and Systems (IROS).
Nuechter A, Surmann
H, Lingemann K, Hertzberg J, Thrun
S (2004). 6D SLAM with an Application in Autonomous Mine
Mapping. In Proceedings of the IEEE International Conference on Robotics and
Automation (ICRA ’04), pages 1998 – 2003,
Nüchter A, Lingemann
K, Hertzberg J (2006). Extracting Drivable
Surfaces in Outdoor 6D SLAM. In Proceedings of the
37th International Symposium on Robotics (ISR ’06) and Robotik
Dickmanns ED (2007) Dynamic
Vision for Perception and Control of Motion. Springer-Verlag,