Beram, R., El-Kotory, A. (2024). Simulation-Based Assessment of Classification Methods: Statistical Models vs. Machine Learning Algorithms. The Egyptian Statistical Journal, 68(1), 91-124. doi: 10.21608/esju.2024.260404.1025
Reham Beram; Ahmed El-Kotory. "Simulation-Based Assessment of Classification Methods: Statistical Models vs. Machine Learning Algorithms". The Egyptian Statistical Journal, 68, 1, 2024, 91-124. doi: 10.21608/esju.2024.260404.1025
Beram, R., El-Kotory, A. (2024). 'Simulation-Based Assessment of Classification Methods: Statistical Models vs. Machine Learning Algorithms', The Egyptian Statistical Journal, 68(1), pp. 91-124. doi: 10.21608/esju.2024.260404.1025
Beram, R., El-Kotory, A. Simulation-Based Assessment of Classification Methods: Statistical Models vs. Machine Learning Algorithms. The Egyptian Statistical Journal, 2024; 68(1): 91-124. doi: 10.21608/esju.2024.260404.1025
Simulation-Based Assessment of Classification Methods: Statistical Models vs. Machine Learning Algorithms
Department of Statistics, Mathematics and Insurance, Faculty of Business, Alexandria University, Alexandria, Egypt.
Abstract
Current studies evaluated the effectiveness of categorization techniques primarily using real datasets with unreported or unknown statistical features. This simulation-based study aims to compare the performance of statistical models (logistic regression, probit regression, and discriminant analysis) with machine learning algorithms (support vector machines, classification and regression trees, and k-nearest neighbors) to comprehensively understand their suitability for classification tasks. Although simulated datasets are used to control their statistical characteristics, the Pima Indian Diabetes real dataset is used to verify the study findings. The outcomes of this study have the potential to guide practitioners and researchers in selecting the most appropriate modeling technique for their specific needs, ultimately enhancing the accuracy and reliability of classification outcomes across various domains. The results revealed that the two statistical models -probit and logit- outperformed in most simulation scenarios. Markedly, the well-grounded, theory-based models of the logit regression and the probit regression models yielded the most accurate predictions in 78.5% and 83.6% of the simulated scenarios, respectively. Interestingly, the performance of the probit model was the best when the binary response variable was balanced (τ=0.50) and when it was too imbalanced (τ=0.90). Notably, the resulting performance metrics of the real dataset refer to the logit, followed by the probit, being the best-predicting models, which resembles the outcome of the simulation study.