TY - GEN
T1 - Performance Evaluation of Pre-Trained Models for Classification of Vocal Cord Paralysis over Vowels
AU - Jayashree Hegde, K.
AU - Manjula Shenoy, K.
AU - Devaraja, K.
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - This paper intends to explore the use of various Deep Convolutional Neural Networks(DCNNs) and Vision Transformers(ViTs) in the classification of Vocal Cord Paralysis(VCP). A non-invasive approach to classify VCP using spectrograms can be an effective viewpoint for scrutinizing. From the Saarbruecken Voice Database (SVD) database, audio samples of the vowels /a/, / i /, and / u / at various pitches are considered and sampled at different frequencies. Utilizing the Transfer Learning(TL) along with various models, a remarkable accuracy of 93% for ViT-B/16 and 88% for MobileNet is observed for the vowel/a/ at 44.1 KHz. The vowels /u/ and /i/ also performed exceptionally well with an accuracy of 90% at 50 KHz and 87% at 16 KHz, respectively, expressing the distinguishing capability of the models over an imbalanced and limited dataset, which controlled the notion of the need for humongous input. As a result, tuning the frequency and vowels can help automate the analysis of the voice-associated pathology VCP.
AB - This paper intends to explore the use of various Deep Convolutional Neural Networks(DCNNs) and Vision Transformers(ViTs) in the classification of Vocal Cord Paralysis(VCP). A non-invasive approach to classify VCP using spectrograms can be an effective viewpoint for scrutinizing. From the Saarbruecken Voice Database (SVD) database, audio samples of the vowels /a/, / i /, and / u / at various pitches are considered and sampled at different frequencies. Utilizing the Transfer Learning(TL) along with various models, a remarkable accuracy of 93% for ViT-B/16 and 88% for MobileNet is observed for the vowel/a/ at 44.1 KHz. The vowels /u/ and /i/ also performed exceptionally well with an accuracy of 90% at 50 KHz and 87% at 16 KHz, respectively, expressing the distinguishing capability of the models over an imbalanced and limited dataset, which controlled the notion of the need for humongous input. As a result, tuning the frequency and vowels can help automate the analysis of the voice-associated pathology VCP.
UR - https://www.scopus.com/pages/publications/85207405652
UR - https://www.scopus.com/pages/publications/85207405652#tab=citedBy
U2 - 10.1109/NMITCON62075.2024.10699034
DO - 10.1109/NMITCON62075.2024.10699034
M3 - Conference contribution
AN - SCOPUS:85207405652
T3 - 2nd IEEE International Conference on Networks, Multimedia and Information Technology, NMITCON 2024
BT - 2nd IEEE International Conference on Networks, Multimedia and Information Technology, NMITCON 2024
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2nd IEEE International Conference on Networks, Multimedia and Information Technology, NMITCON 2024
Y2 - 9 August 2024 through 10 August 2024
ER -