Having worked at a financial insitution and writing consolidation loans for individuals who could not pay their loans, financial risk, including credit risk is an inherently unbalanced classification problem, as good loans easily outnumber risky loans. Therefore different techniques are needed to train and evaluate models with unbalanced classes. In this analysis, credit card data will be oversampled using the RandomOverSampler and SMOTE algorithms, and undersampled using the ClusterCentroids algorithm. Then, a combinatorial approach of over- and undersampling using the SMOTEENN algorithm will be conducted. Next, the machine learning models BalancedRandomForestClassifier and EasyEnsembleClassifier will be used to predict credit risk. Finally, there will be an evaluation the performance of these models and a written recommendation on whether they should be used to predict credit risk.
- Naive Random Oversampling Results: The balanced accuracy test is 65.72%, the precision score for high risk is very low at 1%. The recall is 62%.
- SMOTE Oversampling Results: The balanced accuracy test is 64.78%, the precision score for high risk is very low at 1%. The recall is 68%.
- Undersampling Results: The balanced accuracy test is 54.43%, the precision score is 99%. The recall is 40%.
- Combination (Undersampling and Oversampling) Results: The balanced accuracy test is 64.47%, the precision score is 99%. The recall is 57%.
- Balanced Random Forest Classifier Results: The balanced accuracy test is 77.38%, the precision score is 99%. The recall is 87%.
- Easy Ensemble AdaBoost Classifier Results: The balanced accuracy test is 93.17%, the precision score is 99%. The recall is 94%.
The first four models dealt with undersampling, oversampling, and a combination of both under and oversampling. These models were used to analyze credit card data and determine which model is the most effective at predicting the highest risk loans. The ensemble classifier is used to analyze and predict which loans are high risk or low risk. The first four models have accuracy scores that are not as high as the ensemble classifiers. Their recall percentages are low as well. Essemble classifiers have the best balance of precision and recall, which is preferable in a model. Therefore, I recomment the Easy Ensemble Classifier model.