Effect of Random Under sampling, Oversampling, and SMOTE on the Performance of Cardiovascular Disease Prediction Models

Uswatun Hasanah; Agus Mohamad Soleh; Kusman Sadik

doi:10.20956/j.v21i1.35552

Authors

Uswatun Hasanah IPB University
Agus Mohamad Soleh
Kusman Sadik IPB University

DOI:

https://doi.org/10.20956/j.v21i1.35552

Keywords:

Cardiovascular Disease, Machine Learning, Resampling Techniques

Abstract

Cardiovascular Disease (CVD) or commonly known as Heart Disease is a leading cause of mortality globally, prompting extensive research into predictive models to assess individual risk and plan preventive measures. Machine learning approaches such as Random Forest, Support Vector Machine (SVM), and LASSO Logistic Regression have showed promise. Recent studies have indicated that traditional resampling methods like Random Oversampling, Random Undersampling, and SMOTE may not significantly improve model discrimination. This study aims to evaluate the impact of these techniques on the performance of Cardiovascular Disease (CVD) prediction models, utilizing data from the UCI Machine Learning Heart Disease database. By employing LASSO Logistic Regression, Random Forest, and Support Vector Machine (SVM) with resampling techniques, including Random Oversampling, Random Undersampling, and SMOTE. This research seeks to enhance understanding of model performance in addressing class imbalances within the dataset and contribute to refining cardiovascular disease (CVD) prediction strategies. This study demonstrates that the use of the SMOTE technique significantly enhances the performance of cardiovascular disease (CVD) prediction models. Specifically, when combined with the Random Forest algorithm, SMOTE achieves the best performance in terms of accuracy, sensitivity, and specificity. This highlights the importance of selecting appropriate resampling techniques to handle class imbalance in datasets. Consequently, this research contributes to refining CVD prediction strategies and provides new insights into improving prediction accuracy in imbalanced medical data.

References

Agresti, A., 2002. Categorical Data Analysis Second Edition. John Wiley & Sons Inc., New York.

Alkhalaf, M., Yu, P., Shen, J., & Deng, C., 2022. A review of the application of machine learning in adult obesity studies. Applied Computing and Intelligence, 2(1), 32–48. https://doi.org/10.3934/aci.2022002

Arabameri, A., Saha, S., Chen, W., Roy, J., Pradhan, B., & Bui, D. T. (2020). Flash flood susceptibility modelling using functional tree and hybrid ensemble techniques. Journal of Hydrology, 587, 125007. https://doi.org/10.1016/j.jhydrol.2020.125007

Bammou, L., Kharchouf, M., Boughanem, H., Douik, A., & El Fazziki, A. (2024). Predictive models for gully erosion susceptibility using machine learning techniques. Environmental Earth Sciences, 83(5), 283. https://doi.org/10.1007/s12665-024-10023-4

Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P., 2002. SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research, 16, 321-357. https://doi.org/10.1613/jair.953

Cortes, C., & Vapnik, V., 1995. Support-vector networks. Machine Learning, 20(3), 273-297. https://doi.org/10.1007/BF00994018

Friedman, J., Hastie, T., Tibshirani, R., 2010. Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1). PMID: 20808728 https://doi.org/10.18637/jss.v033.i01

Goel, E. & Abhilasha, E., 2017. Random Forest: A Review. Int. J. Adv. Res. Comput. Sci. Softw. Eng., 7(1), 251-257. https://doi.org/10.23956/ijarcsse.v7i1.006

Han, J., Kamber, M., & Pei, J., 2012. Data Mining Concepts and Techniques. Morgan Kaufmann Publisher.

Indrawati, A., Subagyo, H., Sihombing, A., Wagiyah, & Afandi, S., 2020. Analyzing the impact of resampling method for imbalanced data text in Indonesian scientific articles categorization. Jurnal Baca, 41(2). https://doi.org/10.14203/j.baca.v41i2.563

Kim, S. M., Kim, Y., Jeong, K., Jeong, H., & Kim, J., 2018. Logistic LASSO regression for the diagnosis of breast cancer using clinical demographic data and the BI-RADS lexicon for ultrasonography. Ultrasonography, 37(1), 36-42. https://doi.org/10.14366/usg.17054

Lunardon, N., Menardi, G., & Torelli, N., 2014. ROSE: A Package for Binary Imbalanced Learning. R Journal, 6, 79–89. https://doi.org/10.32614/RJ-2014-008

Ma, Y., & He, H., 2013. Imbalanced Learning: Foundations, Algorithms, and Applications. John Wiley & Sons, Hoboken, NJ, USA.

Pereira, J. M., Basto, M., & Ferreira da Silva, A., 2016. The Logistic Lasso and Ridge Regression in Predicting Corporate Failure. Procedia Economics and Finance, 39, 634-641. https://doi.org/10.1016/S2212-5671(16)30292-2

Roth, G. A., Abate, D., Abate, K. H., Abay, S. M., Abbafati, C., Abbasi, N., et al., 2018. Global, regional, and national age-sex-specific mortality for 282 causes of death in 195 countries and territories, 1980–2017: a systematic analysis for the global burden of disease study 2017. Lancet, 392(10159), 1736–88. https://doi.org/10.1016/S0140-6736(18)32203-7

Van Goorbergh, R. vd M., Timmerman, D., & Van Calster, B., 2022. The harm of class imbalance corrections for risk prediction models: illustration and simulation using logistic regression. arXiv Preprint arXiv:220209101. https://doi.org/10.48550/arXiv.2202.09101

Wongvorachan, T., He, S., & Bulut, O., 2023. A Comparison of Undersampling, Oversampling, and SMOTE Methods for Dealing with Imbalanced Classification in Educational Data Mining. Information, 14, 54. https://doi.org/10.3390/info14010054

Yang, C., Fridgeirsson, E. A., Kors, J. A., et al., 2024. Impact of random oversampling and random undersampling on the performance of prediction models developed using observational health data. J Big Data, 11, 7. https://doi.org/10.1186/s40537-023-00857-7

Zailani, A. U., & Hanun, N. L., 2020. Penerapan Algoritma Klasifikasi Random Forest Untuk Penentuan Kelayakan Pemberian Kredit Di Koperasi Mitra Sejahtera. Infotech: Journal of Technology Information, 6(1), 7-14. https://doi.org/10.37365/jti.v6i1.61

Zhang, J., & Chen, L., 2019. Clustering-based undersampling with random over sampling examples and support vector machine for imbalanced classification of breast cancer diagnosis. Computer Assisted Surgery, 24(sup2), 62–72. https://doi.org/10.1080/24699322.2019.1649074