TY - GEN
T1 - Comprehensive Performance Analysis of PySpark and Pandas for Classification and Clustering Task
AU - Tufani, Rojer
AU - Sen, Snigdha
AU - Chakraborty, Pavan
N1 - Publisher Copyright:
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025.
PY - 2025
Y1 - 2025
N2 - This article focuses on the comprehensive evaluation of performance across PySpark and Pandas, employing machine learning algorithms for clustering and classification. We have also implemented a data pipeline in PySpark. For the clustering task, we have applied K-Means and Gaussian Mixture Model (GMM) on the SDSS15 dataset using both PySpark and Pandas. The work compares training time and Silhouette scores, shedding light on the effectiveness and performance of each approach. In the classification domain, Logistic Regression, Decision Tree, Random Forest, and Gradient Boosting Tree (GBT) algorithms are applied to the Credit Card Fraud Detection dataset using PySpark and Pandas libraries. Here, training time and classification accuracies are compared, providing the strengths and weaknesses of each implementation. The finding from this study sheds light on the various trade-offs that exist between the PySpark and Pandas libraries and offers useful details regarding the benefits and drawbacks of each tool for handling challenging machine learning tasks. Considering all, this work offers comprehensive guidance to practitioners searching for optimal algorithms for clustering and classification tasks in a single-node environment.
AB - This article focuses on the comprehensive evaluation of performance across PySpark and Pandas, employing machine learning algorithms for clustering and classification. We have also implemented a data pipeline in PySpark. For the clustering task, we have applied K-Means and Gaussian Mixture Model (GMM) on the SDSS15 dataset using both PySpark and Pandas. The work compares training time and Silhouette scores, shedding light on the effectiveness and performance of each approach. In the classification domain, Logistic Regression, Decision Tree, Random Forest, and Gradient Boosting Tree (GBT) algorithms are applied to the Credit Card Fraud Detection dataset using PySpark and Pandas libraries. Here, training time and classification accuracies are compared, providing the strengths and weaknesses of each implementation. The finding from this study sheds light on the various trade-offs that exist between the PySpark and Pandas libraries and offers useful details regarding the benefits and drawbacks of each tool for handling challenging machine learning tasks. Considering all, this work offers comprehensive guidance to practitioners searching for optimal algorithms for clustering and classification tasks in a single-node environment.
UR - https://www.scopus.com/pages/publications/105018802649
UR - https://www.scopus.com/pages/publications/105018802649#tab=citedBy
U2 - 10.1007/978-981-96-4008-9_19
DO - 10.1007/978-981-96-4008-9_19
M3 - Conference contribution
AN - SCOPUS:105018802649
SN - 9789819640072
T3 - Lecture Notes in Networks and Systems
SP - 251
EP - 262
BT - Advances in Health Informatics, Intelligent Systems, and Networking Technologies - Proceedings of HINT 2024
A2 - Jeyabose, Andrew
A2 - Jeyabose, Andrew
A2 - Balas, Valentina Emilia
A2 - Balas, Valentina Emilia
A2 - Fernandes, Steven L.
PB - Springer Science and Business Media Deutschland GmbH
T2 - International Conference on Health Informatics, Intelligent Systems, and Networking Technologies, HINT 2024
Y2 - 13 March 2024 through 14 March 2024
ER -