Skip to main navigation Skip to search Skip to main content

Empirical Study of Feature Selection Methods in Regression for Large-Scale Healthcare Data: A Case Study on Estimating Dental Expenditures

  • Veena Mayya
  • , Christian King*
  • , Giang T. Vu
  • , Varadraj Gurupur
  • *Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

The complexity and high dimensionality of healthcare data present substantial challenges in building machine learning (ML) models, given the large number of variables such as patient demographics and medical history. Effective feature selection is crucial to address issues such as increased computational resource, longer training times, overfitting, and reduced model interpretability, etc. This study evaluates a range of feature selection methods to identify the most impactful features for predicting dental expenditures using publicly available Medical Expenditure Panel Survey (MEPS) data. Sixteen ML models are assessed to determine the top performing model, after which state-of-the-art filter, wrapper, embedded, and hybrid feature selection techniques are applied to optimize the feature set. The highest performance, in terms of coefficient of determination (R 2), is achieved using a hybrid feature selection method that combines the mutual information filter with the embedded features from the CatBoost regressor. The results indicate that the proposed system is suitable for real-time deployment even with reduced features, providing potential benefits such as minimizing the need for irrelevant and difficult-to-obtain features. Moreover, automated feature selection significantly enhances model performance, yielding a R 2 score of 0.86, compared to the score 0.73 achieved with carefully selected manual features. Additionally, to enhance the interpretability of the top-performing ML model, explanatory visualizations are employed to examine the influence of key features on predicting dental expenditures.

Original languageEnglish
Pages (from-to)153564-153579
Number of pages16
JournalIEEE Access
Volume12
DOIs
Publication statusPublished - 2024

All Science Journal Classification (ASJC) codes

  • General Computer Science
  • General Materials Science
  • General Engineering

Fingerprint

Dive into the research topics of 'Empirical Study of Feature Selection Methods in Regression for Large-Scale Healthcare Data: A Case Study on Estimating Dental Expenditures'. Together they form a unique fingerprint.

Cite this