TY - GEN
T1 - Implementation of Cascade Learning using Apache Spark
AU - Mayank, Kumar
AU - Sen, Snigdha
AU - Chakraborty, Pavan
N1 - Publisher Copyright:
© 2022 IEEE.
PY - 2022
Y1 - 2022
N2 - With the exponential growth of data, many technologies have also been developed to cope with the need to process such big dataset and generate meaningful information out of those dataset. To deal with such problems several frameworks were developed and Apache Hadoop and Apache Spark are one of the best in that category, which proved to be very useful in dealing with such large datasets. In this paper we are dealing with two approaches, both demonstrating cascade learning which is one of the best ways to improve accuracy of a machine learning model such as ours. The dataset considered here is from cosmology redshift data. In the first approach we are using Elephas that helps us in developing an end to end deep learning pipeline in the Apache Spark environment. Once the model gives out the result, we are implementing Cascading over it. And for the second approach we transformed the Redshift attribute into eight different classes ranging from 0 to 7. Later we created a framework using Apache Spark and imposed Cascade Learning over it and ran our model based on modified dataset. The goal in both the approaches is to improve the accuracy with the help of cascading. Among five classifiers we experimented including Decision Tree, Random Forest, Naive Bayes, Logistic Regression and Multilayer Perceptron, the best result came from the Decision Tree where training accuracy improved by 0.98%, test accuracy improved by 1.24% and test precision improved by 0.31% after cascading.
AB - With the exponential growth of data, many technologies have also been developed to cope with the need to process such big dataset and generate meaningful information out of those dataset. To deal with such problems several frameworks were developed and Apache Hadoop and Apache Spark are one of the best in that category, which proved to be very useful in dealing with such large datasets. In this paper we are dealing with two approaches, both demonstrating cascade learning which is one of the best ways to improve accuracy of a machine learning model such as ours. The dataset considered here is from cosmology redshift data. In the first approach we are using Elephas that helps us in developing an end to end deep learning pipeline in the Apache Spark environment. Once the model gives out the result, we are implementing Cascading over it. And for the second approach we transformed the Redshift attribute into eight different classes ranging from 0 to 7. Later we created a framework using Apache Spark and imposed Cascade Learning over it and ran our model based on modified dataset. The goal in both the approaches is to improve the accuracy with the help of cascading. Among five classifiers we experimented including Decision Tree, Random Forest, Naive Bayes, Logistic Regression and Multilayer Perceptron, the best result came from the Decision Tree where training accuracy improved by 0.98%, test accuracy improved by 1.24% and test precision improved by 0.31% after cascading.
UR - https://www.scopus.com/pages/publications/85138260004
UR - https://www.scopus.com/pages/publications/85138260004#tab=citedBy
U2 - 10.1109/CONECCT55679.2022.9865798
DO - 10.1109/CONECCT55679.2022.9865798
M3 - Conference contribution
AN - SCOPUS:85138260004
T3 - 2022 IEEE International Conference on Electronics, Computing and Communication Technologies, CONECCT 2022
BT - 2022 IEEE International Conference on Electronics, Computing and Communication Technologies, CONECCT 2022
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2022 IEEE International Conference on Electronics, Computing and Communication Technologies, CONECCT 2022
Y2 - 8 July 2022 through 10 July 2022
ER -