Public health care systems routinely collect health-related data from the population. This data can be analyzed using data mining techniques to find novel, interesting patterns, which could help formulate effective public health policies and interventions. The occurrence of chronic illness is rare in the population and the effect of this class imbalance, on the performance of various classifiers was studied. The objective of this work is to identify the best classifiers for class imbalanced health datasets through a cost-based comparison of classifier performance. The popular, open-source data mining tool WEKA, was used to build a variety of core classifiers as well as classifier ensembles, to evaluate the classifiers" performance. The unequal misclassification costs were represented in a cost matrix, and cost-benefit analysis was also performed. In another experiment, various sampling methods such as under-sampling, over-sampling, and SMOTE was performed to balance the class distribution in the dataset, and the costs were compared. The Bayesian classifiers performed well with a high recall, low number of false negatives and were not affected by the class imbalance. Results confirm that total cost of Bayesian classifiers can be further reduced using cost-sensitive learning methods. Classifiers built using the random under-sampled dataset showed a dramatic drop in costs and high classification accuracy.

Original languageEnglish
Pages (from-to)2215-2222
Number of pages8
JournalInternational Journal of Electrical and Computer Engineering
Issue number4
Publication statusPublished - 2017

All Science Journal Classification (ASJC) codes

  • Computer Science(all)
  • Hardware and Architecture
  • Computer Networks and Communications
  • Electrical and Electronic Engineering


Dive into the research topics of 'Learning from a class imbalanced public health dataset: A cost-based comparison of classifier performance'. Together they form a unique fingerprint.

Cite this