Skip to content

Python pipeline for MNIST digit classification with PCA for dimensionality reduction and K-Nearest Neighbours with inverse-distance weighting. Includes training, evaluation, and model-saving capabilities, optimised for noisy and occluded datasets.

Notifications You must be signed in to change notification settings

Shohail-Ismail/MNIST-PCA-KNN-Classifier

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MNIST PCA-KNN Classifier

Background information

PCA-KNN classifier for a modified MNIST dataset made for my COM2104 (Data Driven Computing) module, where the goal is to accurately predict labels (digits 0 - 9) from 28x28 grayscale images of handwritten digits. The classifier is trained on 3,000 images and has been tested on 2 datasets, namely 1000 noisy images simulated using Gaussian noise and 1000 masked images with 15x15 block occlusions.

In the current implementation, the model outputs a classification accuracy of 90.60% for the noisy dataset and 77.50% for the masked dataset.

Core features

  • Principal Component Analysis (PCA) is used to reduce the dimensionality of the 786-dimensional (28x28) images, projecting them onto the top 55 principal components to retain the most information while discarding noise, thereby boosting computational efficiency and preventing overfitting.

  • A K-Nearest Neighbours (KNN) classifier with a K value of 5 is used, along with inverse-distance weighting to prioritise closer neighbours in the feature space, which reduces misclassification caused by outliers.

  • Tools for training the model, evaluating it on the test datasets, and saving/loading trained models are included for ease of testing and deployment.

Limitations

  • The KNN classifier requires storing the entire training dataset in memory and performing computationally-expensive distance calculations during inference, which means that scalability to larger datasets is limited.
  • PCA parameters are manually tuned, so future iterations could benefit from automated optimisation methods.
  • Euclidean distance is used as the metric for KNN, which may not be optimal for datasets wih lots of overlap (for these, a metric like cosine similarity would be a better choice as it accounts for directional similarity).

Environment setup

To run the project, ensure you have the following environment set up:

  • Required Python Version: Python 3.12.8

  • Libraries:

    • numpy
    • scipy
    • scikit-learn
    • pandas
    • Pillow
    • joblib
  • Structure:

    • Ensure that the MNIST dataset and its subsets (train, noise_test, mask_test) are placed in the correct directories, as expected by the get_dataset function in utils.py.
  • Model:

    • A trained model can be saved as trained_model.pkl and loaded for inference.

Run the training and evaluation scripts in the following order:

# To train and save the model
python train.py

# To evaluate the model on the test datasets
python evaluate.py 

About

Python pipeline for MNIST digit classification with PCA for dimensionality reduction and K-Nearest Neighbours with inverse-distance weighting. Includes training, evaluation, and model-saving capabilities, optimised for noisy and occluded datasets.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages