ML4Science: The Impact of Whole-Slide Image Resolution on Foundation Model Performance in Computational Pathology
This project focuses on extracting, analysing and using embeddings from histopathology datasets to train and evaluate models for various classification tasks. The ultimate goal is to apply deep learning methods, including ViT-based feature extraction, to facilitate tissue subtyping, pathology classification and related tasks, and to evaluate the relevance of the magnification level of the dataset in the classification results.
CS-433-ml-project-2-ml-ts-4science
├── preprocessing
│ ├── preprocess.py # Patch generation and metadata creation
│ ├── generate_metadata.py # Metadata generation for slide images
│ └── generate_bracs_labels.py # Label extraction for BRACS dataset
├── analysis
│ ├── tiles_analysis.py # Analysis and visualization of extracted tiles
│ └── output # Directory to store output analysis
├── UNI
│ ├── infere_uni.py # Embedding inference using UNI model
│ ├── download_uni.py # Script to download UNI model
│ ├── generate.sh # Batch script for embedding generation
│ ├── generate_uni_embeddings.sh # Batch script for distributed embedding inference
│ └── infer_uni_regions.py # Region-based inference for slides
├── downstream
│ ├── main.py # Training script for NN and k-NN methods
│ ├── models.py # Implementation of various neural network models
│ ├── analyzing_uni_embeddings.py # Analysis of extracted embeddings
│ └── dataset.py # Dataset loader for embeddings and labels
├── requirements.txt # Python dependencies
├── config.yaml # Configuration file for datasets and paths
├── ML4Science.csv # Dataset details
└── README.md # Project documentation
-
Patch Extraction:
- The
preprocess.py
script generates image patches based on the specified resolution (MPP) and patch size. It creates metadata containing tile coordinates and details.
- The
-
Metadata Generation:
generate_metadata.py
collects slide-level information and saves it as structured metadata.
-
Label Mapping:
generate_bracs_labels.py
extracts class labels from filenames for the BRACS dataset.
-
Embedding Generation:
- Scripts like
infere_uni.py
andinfer_uni_regions.py
use the UNI model to compute embeddings for extracted tiles or regions.
- Scripts like
-
Batch Processing:
- Batch scripts (
generate.sh
,generate_uni_embeddings.sh
) are provided for distributed inference across GPUs.
- Batch scripts (
-
Model Download:
download_uni.py
downloads the required pretrained UNI model from HuggingFace Hub.
-
Tile Analysis:
tiles_analysis.py
visualizes tiles, checks overlaps, and validates tile extraction.
-
Embedding Analysis:
analyzing_uni_embeddings.py
performs statistical analysis and dimensionality reduction on embeddings.
-
Classification Models:
- The
main.py
script supports training neural networks and k-NN classifiers using embeddings.
- The
-
Model Architectures:
models.py
contains implementations for MLPs, linear models, and attention mechanisms.
-
Dataset Loader:
dataset.py
provides a PyTorch-compatible loader for embedding-based datasets.
Dataset | Description | Resolution | MPP | Task |
---|---|---|---|---|
BACH | Subtyping into normal, benign, in situ carcinoma, and invasive carcinoma | 20x, 10x, 5x | 0.42 | Region/RoI-level classification |
BRACS | Subtyping of regions of interest (ROIs) extracted from WSIs | 40x, 20x, 10x, 5x | 0.25 | Whole-slide image (WSI) classification |
BreakHis | Histopathological dataset for breast cancer classification | 40x, 20x, 10x, 5x | 0.25 | Patch-level classification |
- Resolutions Supported:
- 40x, 20x, 10x, and 5x magnifications.
- Patch Size:
- Default: 224x224 pixels.
- Pipeline: | Mean pooling of patch embeddings | → | Linear/MLP classifier |
- Pipeline: | Attention mechanism (weighted patch aggregation) | → | Linear/MLP classifier |
-
Neural Networks (NN):
- Supports attention pooling and mean pooling.
- Configurable number of layers and dropout rates.
-
k-Nearest Neighbors (k-NN):
- Cosine and Euclidean similarity metrics.
- Configurable values of
k
.
-
Clone the repository:
git clone https://github.com/CS-433/ml-project-2-ml-ts-4science cd CS-433-ml-project-2-ml-ts-4science
-
Install dependencies:
pip install -r requirements.txt
-
Preprocessing:
python preprocessing/preprocess.py
-
Embedding Inference:
python UNI/infere_uni.py --data_dir <path_to_data> --model_path <path_to_model>
-
Analysis:
python analysis/tiles_analysis.py --data_dir <path_to_tiles>
-
Training:
python downstream/main.py --method nn --dataset BACH --augmentation 20 --pooling GatedAttention --nlayers_classifier 2 --dropout_ratio 0.5 --epochs 100
python downstream/main.py --method knn --dataset BACH --augmentation 5 --similarity cosine
- Ensure data directories are correctly specified in
config.yaml
. - Use batch scripts for efficient processing on HPC environments.
- For custom datasets, update
ML4Science.csv
andconfig.yaml
accordingly.
- Carlos Hurtado Comín
- Mario Rico Ibáñez
- Daniel López Gala
- Sevda Öğüt
- Cédric Vincent-Cuaz
- Vaishnavi Subramanian