Vision-Language Integration: a Double-Grounding Case

TitleVision-Language Integration: a Double-Grounding Case
Publication TypePhD Theses
Year of Publication2005
AuthorsPastra, K
PublisherUniversity of Sheffield
Type of Workphd
Abstract

This thesis explores the issue of vision-language integration from the Artificial Intelligence perspective of building intentional artificial agents able to combine their visual and linguistic abilities automatically. While such a computational vision-language integration is a \textit{sine qua non} requirement for developing a wide range of intelligent multimedia systems, the deeper issue still remains in the research background. What does \textit{integration} actually mean? Why is it needed in Artificial Intelligence systems, how is it currently achieved and how far can we go in developing fully automatic vision-language integration prototypes?

Through a parallel theoretical investigation of visual and linguistic representational systems, the nature and characteristics of the subjects of this integration study, vision and language, are determined. Then, the notion of their computational integration itself is explored. An extensive review of the integration resources and mechanisms used in a wide-range of vision-language integration prototypes leads to a descriptive definition of this integration as a process of establishing associations between images and language. The review points to the fact that state of the art prototypes fail to perform real integration, because they rely on human intervention at key integration stages, in order to overcome difficulties related to features vision and language inherently lack.

In looking into these features so as to discover the real need for integrating vision and language in multimodal situations, intentionality-related issues appear to play a central role in justifying integration. These features are correlated with Searle's theory of intentionality and the Symbol Grounding problem. This leads to a view of the traditionally advocated grounding of language in visual perceptions as a bi-directional, not one-directional, process. It is argued that vision-language integration is rather a case of \textit{double-grounding}, in which linguistic representations are grounded in visual ones for getting direct access to the physical world, while visual representations, in their turn, are grounded in linguistic ones for acquiring a controlled access to mental aspects of the world.
Last, the feasibility of developing a prototype able to achieve this double-grounding with minimal human intervention is explored. VLEMA is presented, a prototype which is fed with automatically reconstructed building-interior scenes, which it subsequently describes in natural language. The prototype includes a number of unique features which point to new directions in building agents endowed with real vision-language integration abilities.