Optimizing Credit Scoring Performance Using Ensemble Feature Selection with Random Forest

Ana Fauziah

doi:10.20956/j.v21i2.42032

Authors

Ana Fauziah Prodi Pendidikan Matematika , Fakultas Keguruan dan Ilmu Pendidikan, Universitas Bakti Indonesia

DOI:

https://doi.org/10.20956/j.v21i2.42032

Keywords:

Classification, credit scoring, feature selection, ensemble method, random forest

Abstract

Credit scoring has a very important role in the financial industry to assess the eligibility of loan applicants and mitigate credit risk. However, the main challenge in credit scoring modeling is the large number of features that need to be considered. Feature selection becomes an inevitable step to improve model performance. This research proposes the use of hybrid ensemble boosting techniques through XGBoost, LightGBM, and CatBoost methods, as well as aggregation techniques for feature selection, the results of which are then used to build predictive models using Random Forest. Experimental results show that the aggregation technique using feature slices selected by the three methods provides the best model with the least number of features, which is only about 11% of the total features. The use of fewer features not only increases the computational speed and efficiency of the model but also improves the generalization ability, which allows the model to perform better on new data. In addition, this model shows the smallest difference between train accuracy and mean cross-validation score, indicating high model stability and reliability.

References

Abellán, J., & Castellano, J. G., 2017. A comparative study on base classifiers in ensemble methods for credit scoring. Expert Systems with Applications, Vol. 73, 1–10.

[2] Arora, N., & Kaur, P. D., 2020. A Bolasso based consistent feature selection enabled random forest classification algorithm: An application to credit risk assessment. Applied Soft Computing Journal, Vol. 86, No. 105936.

[3] Bashir, S., Khattak, I. U., Khan, A., Khan, F. H., Gani, A., & Shiraz, M., 2022. A Novel Feature Selection Method for Classification of Medical Data Using Filters, Wrappers, and Embedded Approaches. Complexity, Vol. 2022. https://doi.org/10.1155/2022/8190814. [08 November 2024]

[4] Caruana, R., & Niculescu-Mizil, A., 2004. Data mining in metric space: An empirical analysis of supervised learning performance criteria. KDD-2004 - Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 69–78.

[5] Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P., 2002. SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research, Vol. 16, 321–357.

[6] Chen, T., & Guestrin, C., 2016. XGBoost : A Scalable Tree Boosting System. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 785–794. Association for Computing Machinery, New York.

[7] Chuang, L. Y., Yang, C. H., Wu, K. C., & Yang, C. H., 2011. A hybrid feature selection method for DNA microarray data. Computers in Biology and Medicine, Vol. 41, No. 4, 228–237.

[8] Ha, V., & Nguyen, H., 2016. Credit scoring with a feature selection approach based deep learning. 7th International Conference on Mechanical, Industrial, and Manufacturing Technologies, Vol.54 thn 2016, 1–5. EDP Sciences, Les Ulis.

[9] Hand, D. J., & Till, R. J., 2001. A Simple Generalisation of the Area Under the ROC Curve for Multiple Class Classification Problems. Machine Learning, Vol. 45, No. 2, 171–186.

[10] Hastie, T., Robert, T., & Jerome, F., 2009. The Elements of Statistical Learning: Data Mining, Inference, and Predictions, Second Edition. Springer, California.

[11] Ibragimov, B., & Gusev, G., 2019. Minimal variance sampling in stochastic gradient boosting., Advances in Neural Information Processing Systems 32: Proceedings of Annual Conference on Neural Information Processing Systems 2019. NeurIPS, Vancouver.

[12] Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., & Liu, T. Y., 2017. LightGBM: A highly efficient gradient decision tree. Advances in Neural Information Processing Systems 30: Proceedings of Annual Conference on Neural Information Processing Systems 2017, 3147–3155. NIPS, California.

[13] Laborda, J., & Ryoo, S., 2021. Feature selection in a credit scoring model. Mathematics, Vol. 9, No. 7, 746-769.

[14] Liang, D., Tsai, C. F., & Wu, H. T., 2015. The effect of feature selection on financial distress prediction. Knowledge-Based Systems, Vol. 73, No. 1, 289–297.

[15] Rerung, R. R., 2018. Penerapan Data Mining dengan Memanfaatkan Metode Association Rule untuk Promosi Produk. Jurnal Teknologi Rekayasa, Vol. 3, No. 1, 89-98.

[16] Zhou, Z.H., 2012. Ensemble Methods Foundations and Algorithms.CRC Press, Boca Raton.

[17] Zhou, Y., Uddin, M. S., Habib, T., & Chi, G., 2021. Feature selection in credit risk modeling : an international evidence an international evidence. Economic Research-Ekonomska Istraživanja, Vol. 34, No. 2, 1–31.

[18] Zhu, T., Lin, Y., & Liu, Y., 2017. Synthetic minority oversampling technique for multiclass imbalance problems. Pattern Recognition, Vol. 72, 327-340.