Multi-modal speech sensing based on 2D and 3D optical and acoustic signals for identity recognition and authentication
Description
Human speech perception is a process that takes into account both acoustic and visual speech information but yet most automatic recognition systems are typically based on one of the two modes. In this project we will combine the strong between-speaker differences of acoustic dynamics and facial movement dynamics and to build more reliable person identification systems. Facial movements are usually extracted from 2D frontal facial images. Visual 3D facial features obtained with 3D cameras improve the accuracy of facial feature extraction, especially for non-frontal facial images.One goal of this project is to develop a sensing platform to collect both the 2D and 3D optical speech characteristics and the acoustic signals. Dynamic information including 3D facial movements and speech rhythm will be used to improve speaker recognition and authentication. Compared to previous identity recognition methods that usually use either the 2D face or the voice of an individual, the here proposed scheme will be more robust since it is based on simultaneous visual (2D and 3D) and acoustic dynamic speech sensing. Another goal is to investigate the relationship between the acoustic dynamics of speech and 3D facial dynamics and thus pave the way to predicting voices from faces and faces from voices. The obtained deeper understanding of the acoustic and face dynamics of speech will have an impact on various speech technologies, including: automatic speech recognition, lip modelling for speaking face synthesis, real time human computer interface applications and speech-based password authentication.
Key Data
Projectlead
Project status
completed, 02/2020 - 04/2021
Funding partner
Spark / Projekt Nr. 190424