TY - GEN
T1 - ADDRESSING DATA SCARCITY IN VOICE DISORDER DETECTION WITH SELF-SUPERVISED MODELS
AU - Gupta, Rijul
AU - Madill, Catherine
AU - Gunjawate, Dhanshree R.
AU - Nguyen, Duy Duong
AU - Jin, Craig T.
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - Machine learning (ML) has shown promising results in the field of voice disorder detection over the past decade. However, the diversity of recording conditions, audio content, languages, and the scarcity of examples for each of these combinations pose a challenge in building ML models that can reliably detect voice disorders. Recent advancements in Self-Supervised Learning (SSL) offer hope by leveraging large datasets to pretrain models and extract audio features with high resilience for downstream tasks. In this paper, we fairly exhaustively explore commonly used SSL model representations to assess their suitability for addressing the downstream task of voice disorder detection. Using a combination of Support Vector Machines (SVM) and feedforward Deep Neural Networks (DNN) we show: i) that the combination of vowels/a/,/i/, and/u/perform better than individual vowels; ii) SSL-based features generalize well to out-of-domain databases, and iii) that while spectral features like MFCC perform equally well compared to SSL-based features when trained and tested on the same database, performances seems to deteriorate when training and testing across different databases.
AB - Machine learning (ML) has shown promising results in the field of voice disorder detection over the past decade. However, the diversity of recording conditions, audio content, languages, and the scarcity of examples for each of these combinations pose a challenge in building ML models that can reliably detect voice disorders. Recent advancements in Self-Supervised Learning (SSL) offer hope by leveraging large datasets to pretrain models and extract audio features with high resilience for downstream tasks. In this paper, we fairly exhaustively explore commonly used SSL model representations to assess their suitability for addressing the downstream task of voice disorder detection. Using a combination of Support Vector Machines (SVM) and feedforward Deep Neural Networks (DNN) we show: i) that the combination of vowels/a/,/i/, and/u/perform better than individual vowels; ii) SSL-based features generalize well to out-of-domain databases, and iii) that while spectral features like MFCC perform equally well compared to SSL-based features when trained and tested on the same database, performances seems to deteriorate when training and testing across different databases.
UR - https://www.scopus.com/pages/publications/85195388238
UR - https://www.scopus.com/pages/publications/85195388238#tab=citedBy
U2 - 10.1109/ICASSP48485.2024.10446075
DO - 10.1109/ICASSP48485.2024.10446075
M3 - Conference contribution
AN - SCOPUS:85195388238
T3 - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
SP - 11866
EP - 11870
BT - 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024 - Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024
Y2 - 14 April 2024 through 19 April 2024
ER -