Prosody based audiovisual coanalysis for coverbal gesture recognition


Despite recent advances in vision-based gesture recognition, its applications remain largely limited to artificially defined and well-articulated gesture signs used for human-computer interaction. A key reason for this is the low recognition rates for "natural" gesticulation. Previous attempts of using speech cues to reduce error-proneness of visual classification have been mostly limited to keyword-gesture coanalysis. Such scheme inherits complexity and delays associated with natural language processing. This paper offers a novel "signal-level" perspective, where prosodic manifestations in speech and hand kinematics are considered as a basis for coanalyzing loosely coupled modalities. We present a computational framework for improving continuous gesture recognition based on two phenomena that capture voluntary (coarticulation) and involuntary (physiological) contributions of prosodic synchronization. Physiological constraints, manifested as signal interruptions during multimodal production, are exploited in an audiovisual feature integration framework using hidden Markov models. Coarticulation is analyzed using a Bayesian network of naïve classifiers to explore alignment of internationally prominent speech segments and hand kinematics. The efficacy of the proposed approach was demonstrated on a multimodal corpus created from the Weather Channel broadcast. Both schemas were found to contribute uniquely by reducing different error types, which subsequently improves the performance of continuous gesture recognition. © 2005 IEEE.

Publication Title

IEEE Transactions on Multimedia