Session AV Speech:

Audiovisual Speech: Analysis, Synthesis, Perception and Recognition

Type: special
Chair: Sascha Fagel
Date: Thursday - August 09, 2007
Time: 16:00
Room: 3 (Yellow)

 

AV Speech-1 AUDIOVISUAL SPEECH: ANALYSIS, SYNTHESIS, PERCEPTION, AND RECOGNITION
Sascha Fagel, Berlin University of Technology
Paper File
  In many cases research in the fields of audiovisual speech analysis, synthesis, perception and (automatic) recognition is carried out separately with only limited account for the neighboring areas. But the author claims that these neighboring areas yield huge, currently idle potential to improve and better understand the field under investigation and that human speech as a phenomenon should be looked at from a more holistic point of view. This paper briefly looks into the fields of audiovisual speech research and tries to identify existing links between them as well as future collaboration for promising prospective mutual benefit.
AV Speech-2 AUDITORY-VISUAL SPEECH ANALYSIS: IN SEARCH OF A THEORY
Christian Kroos, MARCS Auditory Laboratories, University of Western Sydney
Paper File
  In the last decade auditory-visual speech analysis has benefited greatly from advances in face motion measurement technology. Devices and systems have become widespread, more versatile, easier to use and cheaper. Statistical methods to handle multichannel data returned by the face motion measurements are readily available. However, no comprehensive theory or, minimally common framework to guide auditory-visual speech analysis has emerged. In this paper it is proposed that Articulatory Phonology [3] developed by Browman and Goldstein for auditory-articulatory speech production is capable of filling the gap. Benefits and problems are discussed.
AV Speech-3 AUDIOVISUAL SPEECH SYNTHESIS
Barry-John Theobald, School of Computing Sciences, University of East Anglia
Paper File
  The ultimate goal of audiovisual speech synthesis is to create a machine that is able to articulate human-like audiovisual speech from text. There has been much interest in producing such a system over the last few decades and current state-of-the-art systems can generate very realistic synthesised speech. This paper presents a broad overview of audiovisual speech synthesis and considers possible future directions.
AV Speech-4 SPEECH STRUCTURE DECISIONS FROM SPEECH MOTION COORDINATIONS
Marie-Agnès Cathiard, Université Stendhal
Christian Abry, Université Stendhal
Paper File
  Supporters of speech as an essentially motion phenomenon have forgotten the evidence, coming from static phases in the “elastic speech” flow, that these phases can give direct access to speech structures at their best. In fact the very name of this Structure-from-Motion (SfM) problem means that motion is just for recovering structures, when undersampled. By combining SfM with Multistable Perception, we will reinforce the claim that changes in the perceiver's mind, regarding stationary or repetitive audiovisual speech moving displays, are perceptual decisions on changes in structure, rather than simple low-level decisions on changes in motion direction. The outcome of this quest for speech structure recovery is that, contrary to other perception domains, where scientists are struggling in search of stabilizing biases, the very time unfolding of speech coordinations –and non-human primate calls– gives for free neural control biases within their natural integrative time-windows.
AV Speech-5 AUDIOVISUAL SPEECH RECOGNITION WITH ARTICULATOR POSITIONS AS HIDDEN VARIABLES
Mark Hasegawa-Johnson, University of Illinois at Urbana-Champaign
Karen Livescu, Massachusetts Institute of Technology
Partha Lal , University of Edinburgh
Kate Saenko, Massachusetts Institute of Technology
Paper File
  Speech recognition, by both humans and machines, benefits from visual observation of the face. It has often been noticed, however, that the audible and visible correlates of a phoneme may be asynchronous; perhaps for this reason, automatic speech recognition structures that allow asynchrony between the audible phoneme and the visible viseme outperform recognizers that allow no such asynchrony. This paper proposes, and tests using experimental speech recognition systems, a new explanation for audiovisual asynchrony. We propose that audiovisual asynchrony may be the result of asynchrony between the gestures implemented by different articulators. The proposed model of audiovisual asynchrony is tested by implementing an "articulatory-feature model" audiovisual speech recognizer with multiple hidden state variables, each representing the gestures of one articulator. The proposed system performs as well as a standard audiovisual recognizer on a digit recognition task; the best results are achieved by combining the outputs of the two systems.

Back to Conference Schedule