TY - GEN
T1 - Fine Tuning and Comparing Tacotron 2, Deep Voice 3, and FastSpeech 2 TTS Models in a Low Resource Environment
AU - Gopalakrishnan, T.
AU - Imam, Syed Ayaz
AU - Aggarwal, Archit
N1 - Publisher Copyright:
© 2022 IEEE.
PY - 2022
Y1 - 2022
N2 - Text-to-speech (TTS) models are used to generate speech from a sequence of characters provided as input. Existing TTS systems require a high-quality large dataset and vast computational resources for training. However, most of the publicly available datasets do not meet such standards, and access to powerful GPUs may not always be possible. Hence, in our work, we have successfully trained and compared TTS models, specifically Tacotron 2, FastSpeech 2, and Deep Voice 3 on a Tesla T4 GPU using a subset of the LJSpeechl.1 dataset. Subsequently, we have surveyed to analyze the performance of the models when trained on small datasets, and we discovered that the Tacotron 2 TTS model synthesized the most realistic sounding speeches. The survey revealed that the Tacotron 2 TTS model achieved a mean opinion score (MOS) at a 95% confidence interval of 4.25± 0.17, and sounded the most natural to our listeners when compared to the ground truth.
AB - Text-to-speech (TTS) models are used to generate speech from a sequence of characters provided as input. Existing TTS systems require a high-quality large dataset and vast computational resources for training. However, most of the publicly available datasets do not meet such standards, and access to powerful GPUs may not always be possible. Hence, in our work, we have successfully trained and compared TTS models, specifically Tacotron 2, FastSpeech 2, and Deep Voice 3 on a Tesla T4 GPU using a subset of the LJSpeechl.1 dataset. Subsequently, we have surveyed to analyze the performance of the models when trained on small datasets, and we discovered that the Tacotron 2 TTS model synthesized the most realistic sounding speeches. The survey revealed that the Tacotron 2 TTS model achieved a mean opinion score (MOS) at a 95% confidence interval of 4.25± 0.17, and sounded the most natural to our listeners when compared to the ground truth.
UR - http://www.scopus.com/inward/record.url?scp=85141533190&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85141533190&partnerID=8YFLogxK
U2 - 10.1109/ICDSIS55133.2022.9915932
DO - 10.1109/ICDSIS55133.2022.9915932
M3 - Conference contribution
AN - SCOPUS:85141533190
T3 - IEEE International Conference on Data Science and Information System, ICDSIS 2022
BT - IEEE International Conference on Data Science and Information System, ICDSIS 2022
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2022 IEEE International Conference on Data Science and Information System, ICDSIS 2022
Y2 - 29 July 2022 through 30 July 2022
ER -