TY - GEN
T1 - Chain-of-Thought Reasoning Evaluation Framework for Question Answering System
AU - Aithal, Shivani G.
AU - Rao, Abishek B.
AU - Chandrakala, C. B.
AU - Singh, Sanjay
N1 - Publisher Copyright:
© 2025 IEEE.
PY - 2025
Y1 - 2025
N2 - Question-answering (QA) systems have become crucial in healthcare, decision-making, finance, and research domains. While transformer-based language models have greatly enhanced the generation of contextually relevant answers, they often struggle to provide reliable answers, leading to hallucinations and incorrect predictions. Traditional evaluation metrics, such as exact match, precision, recall, and F1 metric, demonstrate significant limitations in comprehensively assessing the performance and capabilities of QA systems. This research introduces a commonsense reasoning-based evaluation framework for language models, leveraging the Chain-of-Thought (CoT) reasoning approach. Our proposed method integrates CoT reasoning with a GPT o1-mini model to evaluate the predictions from language models. Experiments conducted on the SQuAD 2.0 dataset demonstrate significant improvements in handling the answers generated by the language models and increase the reliability of the QA system. The framework offers robust and interpretable solutions, addressing critical gaps in current QA systems and ensuring reliability and transparency for practical, real-world applications.
AB - Question-answering (QA) systems have become crucial in healthcare, decision-making, finance, and research domains. While transformer-based language models have greatly enhanced the generation of contextually relevant answers, they often struggle to provide reliable answers, leading to hallucinations and incorrect predictions. Traditional evaluation metrics, such as exact match, precision, recall, and F1 metric, demonstrate significant limitations in comprehensively assessing the performance and capabilities of QA systems. This research introduces a commonsense reasoning-based evaluation framework for language models, leveraging the Chain-of-Thought (CoT) reasoning approach. Our proposed method integrates CoT reasoning with a GPT o1-mini model to evaluate the predictions from language models. Experiments conducted on the SQuAD 2.0 dataset demonstrate significant improvements in handling the answers generated by the language models and increase the reliability of the QA system. The framework offers robust and interpretable solutions, addressing critical gaps in current QA systems and ensuring reliability and transparency for practical, real-world applications.
UR - https://www.scopus.com/pages/publications/105006607815
UR - https://www.scopus.com/pages/publications/105006607815#tab=citedBy
U2 - 10.1109/AIDE64228.2025.10987492
DO - 10.1109/AIDE64228.2025.10987492
M3 - Conference contribution
AN - SCOPUS:105006607815
T3 - 2025 International Conference on Artificial Intelligence and Data Engineering, AIDE 2025 - Proceedings
SP - 725
EP - 730
BT - 2025 International Conference on Artificial Intelligence and Data Engineering, AIDE 2025 - Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2025 International Conference on Artificial Intelligence and Data Engineering, AIDE 2025
Y2 - 6 February 2025 through 7 February 2025
ER -