Skip to main navigation Skip to search Skip to main content

Audiovisual speech recognition based on a deep convolutional neural network

  • Shashidhar Rudregowda
  • , Sudarshan Patilkulkarni
  • , Vinayakumar Ravi*
  • , Gururaj H.L.*
  • , Moez Krichen
  • *Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

Audiovisual speech recognition is an emerging research topic. Lipreading is the recognition of what someone is saying using visual information, primarily lip movements. In this study, we created a custom dataset for Indian English linguistics and categorized it into three main categories: (1) audio recognition, (2) visual feature extraction, and (3) combined audio and visual recognition. Audio features were extracted using the mel-frequency cepstral coefficient, and classification was performed using a one-dimension convolutional neural network. Visual feature extraction uses Dlib and then classifies visual speech using a long short-term memory type of recurrent neural networks. Finally, integration was performed using a deep convolutional network. The audio speech of Indian English was successfully recognized with accuracies of 93.67% and 91.53%, respectively, using testing data from two hundred epochs. The training accuracy for visual speech recognition using the Indian English dataset was 77.48% and the test accuracy was 76.19% using 60 epochs. After integration, the accuracies of audiovisual speech recognition using the Indian English dataset for training and testing were 94.67% and 91.75%, respectively.

Original languageEnglish
Pages (from-to)25-34
Number of pages10
JournalData Science and Management
Volume7
Issue number1
DOIs
Publication statusPublished - 03-2024

All Science Journal Classification (ASJC) codes

  • Management Information Systems
  • Information Systems
  • Computer Science Applications
  • Management Science and Operations Research
  • Information Systems and Management
  • Artificial Intelligence

Fingerprint

Dive into the research topics of 'Audiovisual speech recognition based on a deep convolutional neural network'. Together they form a unique fingerprint.

Cite this