Abstract
Audiovisual speech recognition is an emerging research topic. Lipreading is the recognition of what someone is saying using visual information, primarily lip movements. In this study, we created a custom dataset for Indian English linguistics and categorized it into three main categories: (1) audio recognition, (2) visual feature extraction, and (3) combined audio and visual recognition. Audio features were extracted using the mel-frequency cepstral coefficient, and classification was performed using a one-dimension convolutional neural network. Visual feature extraction uses Dlib and then classifies visual speech using a long short-term memory type of recurrent neural networks. Finally, integration was performed using a deep convolutional network. The audio speech of Indian English was successfully recognized with accuracies of 93.67% and 91.53%, respectively, using testing data from two hundred epochs. The training accuracy for visual speech recognition using the Indian English dataset was 77.48% and the test accuracy was 76.19% using 60 epochs. After integration, the accuracies of audiovisual speech recognition using the Indian English dataset for training and testing were 94.67% and 91.75%, respectively.
| Original language | English |
|---|---|
| Pages (from-to) | 25-34 |
| Number of pages | 10 |
| Journal | Data Science and Management |
| Volume | 7 |
| Issue number | 1 |
| DOIs | |
| Publication status | Published - 03-2024 |
All Science Journal Classification (ASJC) codes
- Management Information Systems
- Information Systems
- Computer Science Applications
- Management Science and Operations Research
- Information Systems and Management
- Artificial Intelligence
Fingerprint
Dive into the research topics of 'Audiovisual speech recognition based on a deep convolutional neural network'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver