GitHub - CS-433/ml-project-2-ml-ts-4science: ml-project-2-ml-ts-4science created by GitHub Classroom

ML4Science: The Impact of Whole-Slide Image Resolution on Foundation Model Performance in Computational Pathology

This project focuses on extracting, analysing and using embeddings from histopathology datasets to train and evaluate models for various classification tasks. The ultimate goal is to apply deep learning methods, including ViT-based feature extraction, to facilitate tissue subtyping, pathology classification and related tasks, and to evaluate the relevance of the magnification level of the dataset in the classification results.

Directory Structure

CS-433-ml-project-2-ml-ts-4science
├── preprocessing
│   ├── preprocess.py               # Patch generation and metadata creation
│   ├── generate_metadata.py        # Metadata generation for slide images
│   └── generate_bracs_labels.py    # Label extraction for BRACS dataset
├── analysis
│   ├── tiles_analysis.py           # Analysis and visualization of extracted tiles
│   └── output                      # Directory to store output analysis
├── UNI
│   ├── infere_uni.py               # Embedding inference using UNI model
│   ├── download_uni.py             # Script to download UNI model
│   ├── generate.sh                 # Batch script for embedding generation
│   ├── generate_uni_embeddings.sh  # Batch script for distributed embedding inference
│   └── infer_uni_regions.py        # Region-based inference for slides
├── downstream
│   ├── main.py                     # Training script for NN and k-NN methods
│   ├── models.py                   # Implementation of various neural network models
│   ├── analyzing_uni_embeddings.py # Analysis of extracted embeddings
│   └── dataset.py                  # Dataset loader for embeddings and labels
├── requirements.txt                # Python dependencies
├── config.yaml                     # Configuration file for datasets and paths
├── ML4Science.csv                  # Dataset details
└── README.md                       # Project documentation

Key Components

Preprocessing

Patch Extraction:
- The preprocess.py script generates image patches based on the specified resolution (MPP) and patch size. It creates metadata containing tile coordinates and details.
Metadata Generation:
- generate_metadata.py collects slide-level information and saves it as structured metadata.
Label Mapping:
- generate_bracs_labels.py extracts class labels from filenames for the BRACS dataset.

UNI-based Embedding Inference

Embedding Generation:
- Scripts like infere_uni.py and infer_uni_regions.py use the UNI model to compute embeddings for extracted tiles or regions.
Batch Processing:
- Batch scripts (generate.sh, generate_uni_embeddings.sh) are provided for distributed inference across GPUs.
Model Download:
- download_uni.py downloads the required pretrained UNI model from HuggingFace Hub.

Analysis

Tile Analysis:
- tiles_analysis.py visualizes tiles, checks overlaps, and validates tile extraction.
Embedding Analysis:
- analyzing_uni_embeddings.py performs statistical analysis and dimensionality reduction on embeddings.

Downstream Tasks

Classification Models:
- The main.py script supports training neural networks and k-NN classifiers using embeddings.
Model Architectures:
- models.py contains implementations for MLPs, linear models, and attention mechanisms.
Dataset Loader:
- dataset.py provides a PyTorch-compatible loader for embedding-based datasets.

Datasets

Supported Datasets

Dataset	Description	Resolution	MPP	Task
BACH	Subtyping into normal, benign, in situ carcinoma, and invasive carcinoma	20x, 10x, 5x	0.42	Region/RoI-level classification
BRACS	Subtyping of regions of interest (ROIs) extracted from WSIs	40x, 20x, 10x, 5x	0.25	Whole-slide image (WSI) classification
BreakHis	Histopathological dataset for breast cancer classification	40x, 20x, 10x, 5x	0.25	Patch-level classification

Preprocessing Details

Resolutions Supported:
- 40x, 20x, 10x, and 5x magnifications.
Patch Size:
- Default: 224x224 pixels.

Methods

Mean Pooling

Pipeline: | Mean pooling of patch embeddings | → | Linear/MLP classifier |

Attention Mechanism

Pipeline: | Attention mechanism (weighted patch aggregation) | → | Linear/MLP classifier |

Training

Neural Networks (NN):
- Supports attention pooling and mean pooling.
- Configurable number of layers and dropout rates.
k-Nearest Neighbors (k-NN):
- Cosine and Euclidean similarity metrics.
- Configurable values of k.

Usage

Installation

Clone the repository:

git clone https://github.com/CS-433/ml-project-2-ml-ts-4science
cd CS-433-ml-project-2-ml-ts-4science

Install dependencies:
```
pip install -r requirements.txt
```

Running Scripts

Preprocessing:
```
python preprocessing/preprocess.py
```

Embedding Inference:

python UNI/infere_uni.py --data_dir <path_to_data> --model_path <path_to_model>

Analysis:

python analysis/tiles_analysis.py --data_dir <path_to_tiles>

Training:

python downstream/main.py --method nn --dataset BACH --augmentation 20 --pooling GatedAttention --nlayers_classifier 2 --dropout_ratio 0.5 --epochs 100

python downstream/main.py --method knn --dataset BACH --augmentation 5 --similarity cosine

Notes

Ensure data directories are correctly specified in config.yaml.
Use batch scripts for efficient processing on HPC environments.
For custom datasets, update ML4Science.csv and config.yaml accordingly.

Contributors

Carlos Hurtado Comín
Mario Rico Ibáñez
Daniel López Gala

Supervised by LTS4:

Sevda Öğüt
Cédric Vincent-Cuaz
Vaishnavi Subramanian

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ML4Science: The Impact of Whole-Slide Image Resolution on Foundation Model Performance in Computational Pathology

Directory Structure

Key Components

Preprocessing

UNI-based Embedding Inference

Analysis

Downstream Tasks

Datasets

Supported Datasets

Preprocessing Details

Methods

Mean Pooling

Attention Mechanism

Training

Usage

Installation

Running Scripts

Notes

Contributors

Supervised by LTS4:

About

Releases

Packages

Contributors 4

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 80 Commits
UNI		UNI
analysis		analysis
downstream		downstream
preprocessing		preprocessing
.gitignore		.gitignore
ML4Science.csv		ML4Science.csv
README.md		README.md
config.yaml		config.yaml
requirements.txt		requirements.txt

CS-433/ml-project-2-ml-ts-4science

Folders and files

Latest commit

History

Repository files navigation

ML4Science: The Impact of Whole-Slide Image Resolution on Foundation Model Performance in Computational Pathology

Directory Structure

Key Components

Preprocessing

UNI-based Embedding Inference

Analysis

Downstream Tasks

Datasets

Supported Datasets

Preprocessing Details

Methods

Mean Pooling

Attention Mechanism

Training

Usage

Installation

Running Scripts

Notes

Contributors

Supervised by LTS4:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages