Performance Evaluation of Classification Methods on Big Data: Decision Trees, Naive Bayes, K-Nearest Neighbors, and Support Vector Machines

Justin Eduardo Simarmata; Gerhard-Wilhelm  Weber; Debora  Chrisinta

doi:10.20956/j.v20i3.32970

Authors

Justin Eduardo Simarmata Faculty of Teacher Training & Education, University of Timor, East Nusa Tenggara, Indonesia
Gerhard-Wilhelm Weber 2Faculty of Engineering Management, Poznan University of Technology, PUT, Poznań, Poland
Debora Chrisinta Faculty of Agriculture, Science and Health, University of Timor, East Nusa Tenggara, Indonesia

DOI:

https://doi.org/10.20956/j.v20i3.32970

Keywords:

Performance Evaluation, Classification Method, Big Data

Abstract

Performance evaluation of classification methods on big data is becoming increasingly important in addressing the challenges of data analysis at scale. This study aims to conduct a comparative evaluation of the classification method, namely Decision Trees (DT), Naive Bayes (NB), k-Nearest Neighbors (KNN), and Support Vector Machines (SVM), in analysis on big data evaluated from data simulation and application of real data available in the Rstudio package, namely ISLR. The simulation data used consisted of 2 types of datasets generated based on predictor variables that were normally distributed with different averages and variants and response variables generated in classes adjusted to the characteristics of predictor variables with different proportions. Real data are taken from two types of numeric variables and predictor variables available in the package. The number of sample sizes to be evaluated in each method is n = 500, n = 1000 and n = 5000. In real data, sample division is done randomly to maintain data representativeness. At the evaluation stage, the performance of the method is measured using accuracy metrics. The results of the evaluation of the simulation of Dataset 1 show that the methods that have an influence on the quality of the classification produced if applied to Big Data are the DT and KNN methods. However, in Dataset 2 there is a change in the results of the DT method, because of the influence on the number of classes and the proportion of class distribution in the data. The results obtained from data simulation, proven by applying to real data by showing that similar methods provide a quality influence if applied to Big Data, while the NB and SVM methods do not show a consistent influence when applied to Big Data. The results of observations in this study show that the DT and KNN methods have several advantages that make them suitable for application to Big Data.

Author Biography

Gerhard-Wilhelm Weber, 2Faculty of Engineering Management, Poznan University of Technology, PUT, Poznań, Poland

Faculty of Engineering Management

References

. Boris, M. & Milovic, M., 2012. Prediction and decision making in health care using data mining. Kuwait chapter of arabian journal of business and management review, Vol. 1, No. 12, 1–11.

. Chrisinta, D. & Simarmata, J.E., 2023. Analisis Sentimen Penilaian Masyarakat Terhadap Pejabat Publik Menggunakan Algoritma Naïve Bayes Classifier. Komputika: Jurnal Sistem Komputer, Vol. 12, No. 1, 93–101.

. Chen, C.L.P. & Zhang, C.-Y., 2014. Data-intensive applications, challenges, techniques and technologies: A survey on Big Data. Inf Sci (N Y), 314–347.

. Fathi, M., Haghi Kashani, M., Jameii, S.M., & Mahdipour, E., 2022. Big data analytics in weather forecasting: A systematic review. Archives of Computational Methods in Engineering, Vol. 29, No. 2, 1247–1275.

. Gaye, B., Zhang, D., & Wulamu, A., 2021. Improvement of support vector machine algorithm in big data background. Mathematical Problems in Engineering, 1–9.

. Ginting, R., 2022. Analisis Big Data. Klaten: CV. Penerbit Lakeisha.

. Jin, X., Wah, B.W., Cheng, X., & Wang, Y., 2015. Significance and challenges of big data research doi: 10.1016/j.bdr.2. Big data research, Vol. 2, No. 2, 59–64.

. Kramer, O., 2013. K-nearest neighbors. In: Dimensionality reduction with unsupervised nearest neighbors. 13–23.

. Kumar, N. & Maurya, V., 2020. A review on machine learning (feature selection, classification and clustering) approaches of big data mining in different area of research. Journal of Critical Reviews, Vol. 7, No. 19, 2610–2626.

. Kwang, K.J. & Wang, Z., 2019. Sampling techniques for big data analysis. International Statistical Review, Vol. 87, S177–S191.

. Pham, Q. V., Nguyen, D.C., Huynh-The, T., Hwang, W.J., & Pathirana, P.N., 2020. Artificial intelligence (AI) and big data for coronavirus (COVID-19) pandemic: a survey on the state-of-the-arts. IEEE access, Vol. 8, 130820–130839.

. Robert, N., Elder, J., & Miner, G.D., 2009. Handbook of statistical analysis and data mining applications. Academic press.

. Rojas, J.A.R., Kery, M.B., Rosenthal, S., & Dey, A., 2017. Sampling techniques to improve big data exploration. In: IEEE 7th symposium on large data analysis and visualization (LDAV). 26–35.

. Saadoon, M., Hamid, S.H.A., Sofian, H., Altarturi, H.H., Azizul, Z.H., & Nasuha, N., 2022. Fault tolerance in big data storage and processing systems: A review on challenges and solutions. Ain Shams Engineering Journal, Vol. 13, No. 2, 101538.

. Sujatha, R., Chatterjee, J.M., Jhanjhi, N., & Brohi, S.N., 2021. Performance of deep learning vs machine learning in plant leaf disease detection. Microprocess Microsyst, 80 (103615).

. Sunil, K. & Mohbey, K.K., 2022. A review on big data based parallel and distributed approaches of pattern mining. Journal of King Saud University-Computer and Information Sciences, Vol. 34, No. 5, 1639–1662.

. Tanveer, M., Rajani, T., Rastogi, R., Shao, Y.H., & Ganaie, M.A., 2022. Comprehensive review on twin support vector machines. Annals of Operations Research, 1–46.