More than just active control of gaze and attention (‘active vision’);
Expectations of visual appearance are computed in real-time and used for more efficient feature extraction and understanding of fast sequences of images.
This requires
internal representation of models for motion in 3-D space and time (including control activities), and
modeling of perspective mapping as nonlinear measurement process, approximated by a linearization for short time intervals to arrive at a least squares recursive model fit for all states and parameters involved.
Time delays between the different sensory modalities and control output are taken into account; interpretation is synchronized.
For different elements of perception (inertial, visual, auditory, odometry, ..) the best signals available have to be used! For example, due to delay times and motion blur for vision, fast (short-term) egomotion and gaze stabilization should be derived from inertial signals, while long term stability of interpretation is better served by visual feedback (drift problems with integration of inertial signals).
Temporally extended maneuver elements and maneuversare part of theknowledge base for understanding motion processes and for situation assessment (very important, often neglected in CS- or AI-approaches).
For animals and humans their behavioral capabilities are essential parts of the knowledge base; behaviors should be recognizable from small fractions of temporal action elements in certain situations (intent recognition).
Perceptual and behavioral capabilities are represented in corresponding networks that show the interdependences across the levels down to the actual hardware components needed.