PCA-KNN classifier for a modified MNIST dataset made for my COM2104 (Data Driven Computing) module, where the goal is to accurately predict labels (digits 0 - 9) from 28x28 grayscale images of handwritten digits. The classifier is trained on 3,000 images and has been tested on 2 datasets, namely 1000 noisy images simulated using Gaussian noise and 1000 masked images with 15x15 block occlusions.
In the current implementation, the model outputs a classification accuracy of 90.60% for the noisy dataset and 77.50% for the masked dataset.
-
Principal Component Analysis (PCA) is used to reduce the dimensionality of the 786-dimensional (28x28) images, projecting them onto the top 55 principal components to retain the most information while discarding noise, thereby boosting computational efficiency and preventing overfitting.
-
A K-Nearest Neighbours (KNN) classifier with a K value of 5 is used, along with inverse-distance weighting to prioritise closer neighbours in the feature space, which reduces misclassification caused by outliers.
-
Tools for training the model, evaluating it on the test datasets, and saving/loading trained models are included for ease of testing and deployment.
- The KNN classifier requires storing the entire training dataset in memory and performing computationally-expensive distance calculations during inference, which means that scalability to larger datasets is limited.
- PCA parameters are manually tuned, so future iterations could benefit from automated optimisation methods.
- Euclidean distance is used as the metric for KNN, which may not be optimal for datasets wih lots of overlap (for these, a metric like cosine similarity would be a better choice as it accounts for directional similarity).
To run the project, ensure you have the following environment set up:
-
Required Python Version: Python 3.12.8
-
Libraries:
numpy
scipy
scikit-learn
pandas
Pillow
joblib
-
Structure:
- Ensure that the MNIST dataset and its subsets (
train
,noise_test
,mask_test
) are placed in the correct directories, as expected by theget_dataset
function inutils.py
.
- Ensure that the MNIST dataset and its subsets (
-
Model:
- A trained model can be saved as
trained_model.pkl
and loaded for inference.
- A trained model can be saved as
Run the training and evaluation scripts in the following order:
# To train and save the model
python train.py
# To evaluate the model on the test datasets
python evaluate.py