MEi:CogSci Conferences, MEi:CogSci Conference 2010, Dubrovnik

Font Size: 
Modeling joint visual attention using neural networks.
Martin Bezak

Last modified: 2010-06-11


Join attention (JA) could by simply defined as ability to share attention with someone else. It is also understood as redirecting one's attention to match another's focus of attention, based on their behaviours. Joint visual attention is defined as simply looking at the same object that someone else is looking. Join visual attention with a caregiver is one of the abilities that help infants to develop their social cognitive functions. This ability enables infants to acquire various kinds of social capabilities, e. g. language and communication skills. This behaviour is considered to be the first step towards the ability to share experience with others and to negotiate shared meanings. Used with a conventional symbol system, attention-sharing skills helps children learn to communicate what they are perceiving or thinking about, to predict others future behaviours from their current behaviours, and to form representations of what others are perceiving or thinking about.

There is now much experimental and observational evidence on the development of JA skills, not only in typically developing infants and children, but also in children with developmental disabilities, and also in non-human primates. These data show the complexity of this social skill, necessary for almost all social interactions. However there is still lack of refined theoretical models that can explain developmental behaviour. New frameworks in theoretical neuroscience, and computational tools that formalize these frameworks, allow more powerful modelling of the learning processes and information processing traits that allow normal human infants to learn JA behaviours, and the distortions of these learning processes that cause JA deficits associated with certain developmental disabilities. As new models become more sophisticated, they might offer clues to how attention sharing behaviours emerge alongside, and interwoven with developing language abilities [1].

JA is also challenge for social robots which need to perceive the world as humans do and learn from the interactions with environment and humans. In order to do that, a social robot must be able to interpret human activity and behaviour. The vision system of the social robot is responsible for accomplishing tasks like identifying faces, measuring head and hands poses, and recognizing gestures to emulate human social perception.

This work presents a new model, where an agent learns JA ability in a simulated environment. As learning process for our model was selected computational approach of biologically relevant actor-critic form of reinforcement learning (RL), with help of artificial neural networks. RL is an approach that focuses on goal-directed learning, which is sort of reward and punishment learning, where agent seeks to perform correct actions in environment that maximize its long term reward. RL is about learning from interaction how to behave in order to achieve a goal. The reinforcement learning agent and its environment interact over a sequence of discrete time steps. The specification of their interface defines a particular task: the actions are the choices made by the agent, the states are the basis for making the choices, and the rewards are the basis for evaluating the choices. Everything inside the agent is completely known and controllable by the agent, everything outside is incompletely controllable but may or may not be completely known [2] .

For modelling JA problem we choose Continuous Actor-Critic Learning Automaton (CACLA) RL algorithm [3] , which allows model to handle full continuous three-dimensional space. Model uses for learning input information based on caregiver's gaze to perform actions in order to find correct object in caregiver's view direction. CACLA module is composed of two neural networks as function approximators, one for actor, which selects actions in time steps in particular environment's state. Purpose of critic network is to evaluate desirability of these actions. If actor selects action, then critic by using reward value from environment and previously learned experience, decides whether was action good or not good. When action was good, actor network is then updated towards this action by backpropagation of error process. Critic network is also updated every time when action is chosen by information about reward.

Using this learning method, we created simplified model of JA, where agent-infant is placed in virtual 3D environment, which contains dozens of virtual objects, represented as points in space. Agent has limited field of view, so it is able to perceive only part of environment. Also agent is able to rotate its view to look at any position in space. Agent task is to learn in this environment how to find salient object which is in correlation with caregiver's gaze. Direction of caregiver's gaze is presented to agent as value of angle in 2D representation of abstract information from caregiver's face. Agent learns to rotate to objects, by selection of actions, which are represented as changes in horizontal and vertical angles of agent's view. So neural network of actor have to learn to return correct motor commands using input information from caregiver's view and with help of critic network reward evaluation. Reward is computed from information about distance from object and center of the agent's view, but only if object lies in agent's view. Otherwise reward is very small constant number. Agent tries to maximize acquired reward and to decrease distance between object and its view. When distance is sufficiently small, agent finally reaches correct object and then agent receives larger reward. Afterwards caregiver looks at another object, so agent get a new information about caregiver's direction and has to again rotate to correct object. In this way agent develops its sensomotoric coordination without any other external task evaluation. Model was trained at objects distributed in whole environment for a period of one thousand trials, so agent had to look one thousand times on different objects, after that it was tested at novelty objects at random positions.

Experimental results show that model achieve about ninety percent of successful JAs, both on object selected within and without agent's field of view. Therefore trained model is able to face objects in continuous space only with information about caregiver's gaze direction and without any supervised or external evaluation, with very good results.

[1] Deak, G. & Triesch. J. (2006). Origins of shared attention in human infants.
Fujita, S. Itakura (Ed.). Diversity of cognition. University of Kyoto Press.

[2] Sutton R.S., Barto, A.G. (2004). Reinforcement Learning: An Introduction.
MIT Press, Cambridge, MA, 1998 A Bradford Book.

[3] van Hasselt. H, Wiering, M. A. (2007). Reinforcement learning in continuous
action spaces. In Proceedings of the IEEE International Symposium on Adap-
tive Dynamic Programming and Reinforcement Learning, p. 272-279.