The purpose of this analysis is to predict credit risk with machine learning models by using different techniques to train and evaluate models with unbalanced classes.
-
Resampling Models
- Over-sampling method: using the RandomOverSampler & SMOTE algorithms
- Under-sampling method: using the ClusterCentroids algorithm
- Combination Sampling (a combinatorial approach of oversampling and undersampling): using the SMOTEENN algorithm
-
Ensemble Classifiers method
- using BalancedRandomForestClassifier & EasyEnsembleClassifier algorithms
-
Software:
- Jupyter Notebook 6.4.6
- Machine Learning
- Python
- scikit-learn library
- imbalanced-learn library
- Python
-
Data source:
- Credit card credit dataset from LendingClub
- Random Oversampling
- the balanced accuracy score: 66%
- the precision and recall scores (high_risk):
- The sensitivity/recall (71%) is more than the precision (1%)
- there are many false positives (predicted high risk but actually low risk)
- making this a poor algorithm for this dataset
- Synthetic Minority Oversampling Technique (SMOTE)
- the balanced accuracy score: 66%
- the precision and recall scores (high_risk):
- The sensitivity/recall (63%) is more than the precision (1%)
- there are many false positives (predicted high risk but actually low risk)
- making this a poor algorithm for this dataset
- Cluster Centroid Undersampling
- the balanced accuracy score: 54%
- the precision and recall scores (high_risk):
- The sensitivity/recall (69%) is more than the precision (1%)
- there are many false positives (predicted high risk but actually low risk)
- making this a poor algorithm for this dataset
- Combination Sampling With SMOTEENN
- The balanced accuracy score: 64%
- the precision and recall scores (high_risk):
- The sensitivity/recall (72%) is more than the precision (1%)
- there are many false positives (predicted high risk but actually low risk)
- making this a poor algorithm for this dataset
- BalancedRandomForestClassifier
- The balanced accuracy score: 79%
- the precision and recall scores (high_risk):
- The sensitivity/recall (70%) is more than the precision (3%)
- there are many false positives (predicted high risk but actually low risk)
- making this a poor algorithm for this dataset
- The total_rec_prncp and total_pymnt of the credit dataset are the more relevant features or columns
- EasyEnsembleClassifier
- The balanced accuracy score: 93%
- the precision and recall scores (high_risk):
- The sensitivity/recall (92%) is more than the precision (9%)
- there are many false positives (predicted high risk but actually low risk)
- making this a poor algorithm for this dataset
Eventhough the EasyEnsembleClassifier algorithm has the highest balanced accuracy score, 93%, this algorithm and the other algorithms still are not good enough to determine if a credit is high risk because the sensitivity/recall is very high, while the precision is very low. It indicates that there are many false positives (predicted high risk but actually low risk). Clearly, they are not useful algorithms for this dataset. Therefore I would not recommend that they be used to predict credit risk. Maybe a dataset with more obsevations would produce a better result.