Comprehensive Performance Analysis of PySpark and Pandas for Classification and Clustering Task

  • Rojer Tufani
  • , Snigdha Sen*
  • , Pavan Chakraborty
  • *Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

This article focuses on the comprehensive evaluation of performance across PySpark and Pandas, employing machine learning algorithms for clustering and classification. We have also implemented a data pipeline in PySpark. For the clustering task, we have applied K-Means and Gaussian Mixture Model (GMM) on the SDSS15 dataset using both PySpark and Pandas. The work compares training time and Silhouette scores, shedding light on the effectiveness and performance of each approach. In the classification domain, Logistic Regression, Decision Tree, Random Forest, and Gradient Boosting Tree (GBT) algorithms are applied to the Credit Card Fraud Detection dataset using PySpark and Pandas libraries. Here, training time and classification accuracies are compared, providing the strengths and weaknesses of each implementation. The finding from this study sheds light on the various trade-offs that exist between the PySpark and Pandas libraries and offers useful details regarding the benefits and drawbacks of each tool for handling challenging machine learning tasks. Considering all, this work offers comprehensive guidance to practitioners searching for optimal algorithms for clustering and classification tasks in a single-node environment.

Original languageEnglish
Title of host publicationAdvances in Health Informatics, Intelligent Systems, and Networking Technologies - Proceedings of HINT 2024
EditorsAndrew Jeyabose, Andrew Jeyabose, Valentina Emilia Balas, Valentina Emilia Balas, Steven L. Fernandes
PublisherSpringer Science and Business Media Deutschland GmbH
Pages251-262
Number of pages12
ISBN (Print)9789819640072
DOIs
Publication statusPublished - 2025
EventInternational Conference on Health Informatics, Intelligent Systems, and Networking Technologies, HINT 2024 - Manipal, India
Duration: 13-03-202414-03-2024

Publication series

NameLecture Notes in Networks and Systems
Volume1286 LNNS
ISSN (Print)2367-3370
ISSN (Electronic)2367-3389

Conference

ConferenceInternational Conference on Health Informatics, Intelligent Systems, and Networking Technologies, HINT 2024
Country/TerritoryIndia
CityManipal
Period13-03-2414-03-24

All Science Journal Classification (ASJC) codes

  • Control and Systems Engineering
  • Signal Processing
  • Computer Networks and Communications

Fingerprint

Dive into the research topics of 'Comprehensive Performance Analysis of PySpark and Pandas for Classification and Clustering Task'. Together they form a unique fingerprint.

Cite this