Comparison of Basic Statistics and Machine Learning Classification Algorithms in Kalimantan Poverty Prediction with Handling Missing Data
DOI:
https://doi.org/10.20956/j.v22i1.44488Keywords:
Binary Logistic Regression, Extra Trees, Kalimantan, Poverty, Random ForestAbstract
Poverty is a crucial development challenge in Indonesia, including in regencies/cities in Kalimantan that require more attention. In reality, poverty is influenced by various factors. Therefore, this research proposes an analysis comparing the accuracy of basic and statistical machine learning models in predicting poverty rates and finding factors that affect poverty rates. The advance of this research is the performance comparison combined with the handling of missing data. The three models proposed in this study are binary logistic regression with backward stepwise selection, random forest, and extremely randomized trees (extra trees). The data used in this study is secondary data taken from the Indonesian Statistics (BPS) of five provinces in Kalimantan, where the pre-processing is done by handling missing data with a k-nearest neighbor (KNN). The results of the poverty prediction analysis show that the binary logistic regression model is the most accurate compared to random forest and extra trees, with a balanced accuracy of 75%. In addition, based on the best model with the highest accuracy, this study also found significant predictor variables that affect the poverty rate of regencies/cities in Kalimantan: population density, average years of schooling, and per capita expenditure on food.
References
[1] Alfian, G., Syafrudin, M., Fahrurrozi, I., Fitriyani, N. L., Atmaji, F. T. D., Widodo, T., Bahiyah, N., Benes, F., & Rhee, J., 2022. Predicting breast cancer from risk factors using SVM and extra-trees-based feature selection method. Computers, 11(9), 136.
[2] Cavanaugh, J. E., & Neath, A. A., 2019. The Akaike information criterion: Background, derivation, properties, application, interpretation, and refinements. Wiley Interdisciplinary Reviews: Computational Statistics, 11(3), e1460.
[3] De, H., & Acquah, G., 2010. Comparison of Akaike information criterion (AIC) and Bayesian information criterion (BIC) in selection of an asymmetric price relationship. Journal of Development and Agricultural Economics, 2, 1–6.
[4] Fatimah, F., Fitrianto, A., Indahwati, I., Erfiani, E., & Khikmah, K. N., 2023. Synthetic Minority Oversampling Technique Pada Model Logit dan Probit Status Pengangguran Terdidik. Jambura Journal of Mathematics, 5(1), 166–178.
[5] Geurts, P., Ernst, D., & Wehenkel, L., 2006. Extremely randomized trees. Machine Learning, 63, 3–42.
[6] Harris, J. K. (2021). Primer on binary logistic regression. Family Medicine and Community Health, 9(Suppl 1), e001290.
[7] Hassan, S. T., Batool, B., Zhu, B., & Khan, I., 2022. Environmental complexity of globalization, education, and income inequalities: New insights of energy poverty. Journal of Cleaner Production, 340, 130735.
[8] Hilbe, J. M., 2016. Practical guide to logistic regression. CRC Press, Taylor & Francis Group Boca Raton, USA.
[9] Hosmer Jr, D. W., Lemeshow, S., & Sturdivant, R. X., 2013. Applied logistic regression (Vol. 398). John Wiley & Sons.
[10] Jena, M., & Dehuri, S., 2022. An integrated novel framework for coping missing values imputation and classification. IEEE Access, 10, 69373–69387.
[11] Kajiita, R. M., & Kang’ethe, S. M., 2024. Socio-Economic Dynamics Inhibiting Inclusive Urban Economic Development: Implications for Sustainable Urban Development in South African Cities. Sustainability, 16(7), 2803.
[12] Khikmah, K. N., Indahwati, I., Fitrianto, A., Erfiani, E., & Amelia, R., 2022. Backwards stepwise binary logistic regression for determination population growth rate factor in Java Island. Jambura Journal of Mathematics, 4(2), 177–187.
[13] Khikmah, K. N., Sartono, B., Susetyo, B., & Dito, G. A., 2024. Performance Comparative Study of Machine Learning Classification Algorithms for Food Insecurity Experience by Households in West Java. Jurnal Online Informatika, 9(1), 128–137.
[14] Kisiała, W., & Rącka, I., 2021. Spatial and statistical analysis of urban poverty for sustainable city development. Sustainability, 13(2), 858.
[15] Kumar, S., & Gota, V., 2023. Logistic regression in cancer research: A narrative review of the concept, analysis, and interpretation. Cancer Research, Statistics, and Treatment, 6(4), 573–578.
[16] Lalande, F., & Doya, K., 2022. Numerical data imputation: Choose kNN over deep learning. International Conference on Similarity Search and Applications, 3–10.
[17] Mathew, T. E., 2022. An optimized extremely randomized tree model for breast cancer classification. Journal of Theoretical and Applied Information Technology, 100(16), 5234–5246.
[18] Okpala, E. F., Manning, L., & Baines, R. N., 2023. Socio-economic drivers of poverty and food insecurity: Nigeria a case study. Food Reviews International, 39(6), 3444–3454.
[19] Poblete-Cazenave, M., & Pachauri, S., 2021. A model of energy poverty and access: Estimating household electricity demand and appliance ownership. Energy Economics, 98, 105266.
[20] Portet, S., 2020. A primer on model selection using the Akaike Information Criterion. Infectious Disease Modelling, 5, 111–128.
[21] Saeed, U., Jan, S. U., Lee, Y.-D., & Koo, I., 2021. Fault diagnosis based on extremely randomized trees in wireless sensor networks. Reliability Engineering & System Safety, 205, 107284.
[22] Suleiman, T. A., Anyimadu, D. T., Permana, A. D., Ngim, H. A. A., & Scotto di Freca, A., 2024. Two-step hierarchical binary classification of cancerous skin lesions using transfer learning and the random forest algorithm. Visual Computing for Industry, Biomedicine, and Art, 7(1), 15.
[23] Thomas, N. S., & Kaliraj, S., 2024. An Improved and Optimized Random Forest Based Approach to Predict the Software Faults. SN Computer Science, 5(5), 530.
[24] Uralovich, K. S., Toshmamatovich, T. U., Kubayevich, K. F., Sapaev, I. B., Saylaubaevna, S. S., Beknazarova, Z. F., & Khurramov, A., 2023. A primary factor in sustainable development and environmental sustainability is environmental education. Caspian Journal of Environmental Sciences, 21(4), 965–975.
[25] Wahyuningsih, D., Yunaningsih, A., Priadana, M. S., Wijaya, A., Darma, D. C., & Amalia, S., 2020. The dynamics of economic growth and development inequality in Borneo Island, Indonesia. Journal of Applied Economic Sciences, 1(67), 135–143.
[26] Zaidi, A., & Al Luhayb, A. S. M., 2023. Two statistical approaches to justify the use of the logistic function in binary logistic regression. Mathematical Problems in Engineering, 2023(1), 5525675.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Jurnal Matematika, Statistika dan Komputasi

This work is licensed under a Creative Commons Attribution 4.0 International License.

This work is licensed under a Creative Commons Attribution 4.0 International License.
Jurnal Matematika, Statistika dan Komputasi is an Open Access journal, all articles are distributed under the terms of the Creative Commons Attribution License, allowing third parties to copy and redistribute the material in any medium or format, transform, and build upon the material, provided the original work is properly cited and states its license. This license allows authors and readers to use all articles, data sets, graphics and appendices in data mining applications, search engines, web sites, blogs and other platforms by providing appropriate reference.




