Abstract
Audiovisual speech recognition is an emerging research topic. Lipreading is the recognition of what someone is saying using visual information, primarily lip movements. In this study, we created a custom dataset for Indian English linguistics and categorized it into three main categories: (1) audio recognition, (2) visual feature extraction, and (3) combined audio and visual recognition. Audio features were extracted using the mel-frequency cepstral coefficient, and classification was performed using a one-dimension convolutional neural network. Visual feature extraction uses Dlib and then classifies visual speech using a long short-term memory type of recurrent neural networks. Finally, integration was performed using a deep convolutional network. The audio speech of Indian English was successfully recognized with accuracies of 93.67% and 91.53%, respectively, using testing data from two hundred epochs. The training accuracy for visual speech recognition using the Indian English dataset was 77.48% and the test accuracy was 76.19% using 60 epochs. After integration, the accuracies of audiovisual speech recognition using the Indian English dataset for training and testing were 94.67% and 91.75%, respectively.
Original language | English |
---|---|
Pages (from-to) | 25-34 |
Number of pages | 10 |
Journal | Data Science and Management |
Volume | 7 |
Issue number | 1 |
DOIs | |
Publication status | Published - 03-2024 |
All Science Journal Classification (ASJC) codes
- Management Information Systems
- Information Systems
- Computer Science Applications
- Management Science and Operations Research
- Information Systems and Management
- Artificial Intelligence