ML-Based Customer Risk Analysis for Banking: Achieving 80% Recall on Imbalanced Data

Project Overview

This project focuses on customer risk prediction in a banking environment using machine learning models. The primary challenge was dealing with imbalanced data where the minority class represents customers with a higher risk of default. Our goal was to achieve high recall, specifically targeting 80%, while maintaining acceptable levels of precision and accuracy.

Objectives

Goal: Achieve 80% recall on high-risk customers.
Dataset: Imbalanced banking dataset.
Models Used: CART, Random Forest (RF), Gradient Boosting Machine (GBM), LightGBM, and BalancedRandomClassifier.
Evaluation Metrics: Accuracy, Precision, Recall, F1-Score.

Approach

Data Preprocessing:
- Handled missing data.
- Performed feature scaling and encoding for categorical variables.
- Addressed the class imbalance using specialized techniques.
Imbalanced Data Handling:
- Implemented RandomUnderSampler to reduce majority class size.
- Used TomekLinks to remove overlapping data points and further refine the dataset.
Modeling:
- Tried several machine learning models: CART, RF, GBM, and LightGBM.
- The most effective model for handling imbalance was BalancedRandomClassifier.
Hyperparameter Optimization:
- Applied hyperparameter tuning to the BalancedRandomClassifier using grid search to optimize performance.
Model Evaluation:
- The BalancedRandomClassifier provided the best results.

Dataset

The dataset used in this project can be downloaded from Kaggle:

Credit Card Approval Prediction Dataset

Customer Demographics: Age, gender, occupation, etc.
Financial Indicators: Credit history, balance, transaction patterns.
Target Variable: Customer risk level.

For privacy reasons, the dataset is not included in this repository.

Results

Best Model: BalancedRandomClassifier
Final Performance Metrics:
- Accuracy: 74%
- Precision: 72%
- Recall: 80%
- F1-Score: 76%

How to Run

Prerequisites

Install the necessary dependencies by running:

pip install -r requirements.txt

Running the Code

Clone the repository:

git clone https://github.com/aysecnkci/banking-risk-analysis-imbalanced-data.git

Run the Jupyter notebook to preprocess the data, train the model, and evaluate it:

jupyter notebook risk_analysis_banking_imbalanced.ipynb

Repository Structure

├── README.md
├── requirements.txt
├── notebooks/
│   └── risk_analysis_banking_imbalanced.ipynb

Future Work

Experiment with deep learning models to improve recall.
Further tune hyperparameters to explore better performance.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

Thanks to the contributors and the machine learning community for resources and support.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

ML-Based Customer Risk Analysis for Banking: Achieving 80% Recall on Imbalanced Data

Project Overview

Objectives

Approach

Dataset

Results

How to Run

Prerequisites

Running the Code

Repository Structure

Future Work

License

Acknowledgments

Files

README.md

Latest commit

History

README.md

File metadata and controls

ML-Based Customer Risk Analysis for Banking: Achieving 80% Recall on Imbalanced Data

Project Overview

Objectives

Approach

Dataset

Results

How to Run

Prerequisites

Running the Code

Repository Structure

Future Work

License

Acknowledgments