CS598-AIE Experiments

This repository contains scripts and configurations for running various experiments related to deep learning, model training, and GPU-based optimizations. Follow the instructions to set up the environment and execute each experiment.

Setup

Create a Virtual Environment
- Ensure you have Python installed on your system.
- Create a virtual environment:
```
python -m venv env
```
- Activate the virtual environment:
```
# On Linux/Mac
source env/bin/activate

# On Windows
.\env\Scripts\activate
```
- Disclaimer: There may be missing packages and the SLURM scripts may need to be adjusted.
Install Dependencies
- Install required packages:
```
pip install -r requirements.txt
```
Setup Hugging Face
- Configure your Hugging Face account to use models:
```
huggingface-cli login
```
Setup WandB
- Log in to Weights & Biases:
```
wandb login
```

Experiments

Experiment 1: Deepspeed

Objective: Fine-tune LLAMA 3.2 3B model using DeepSpeed for optimized distributed training.
Script: code/cs598-AIE/train-llamaDS.slurm
Output: code/cs598-AIE/experiment1.out
Run Command:
```
sbatch train-llamaDS.slurm
```

Experiment 2: FSDP

Objective: Perform summarization tasks using Fully Sharded Data Parallel (FSDP) for Fine-tune LLAMA 3.2 3B model.
Script: code/cs598-AIE/accelerate_variants/summarization.slurm
Output: code/cs598-AIE/experiment2.out

Run Command:

sbatch accelerate_variants/summarization.slurm

Experiment 3: GPU to GPU transfer for LLAMA 3.2 3B

Objective: Test GPU-to-GPU data transfer to improve performance for asnyc checkpointing.
Script: code/cs598-AIE/GPUtransfer-basic.slurm
Output: code/cs598-AIE/experiment3.out
Run Command:
```
sbatch GPUtransfer-basic.slurm
```

Experiment 4A: T5 training with GPU checkpointing

Objective: This experiment focuses on optimizing training processes by caching heckpoints on GPUs within the same node.
Script: code/cs598-AIE/train_t5.slurm
Output: code/cs598-AIE/experiment4.out
Run Command:
```
sbatch train_t5.slurm
```

Experiment 4B: T5 training with GPU checkpointing

Objective: This experiment focuses on optimizing training processes by caching checkpoints on GPUs across nodes.
Script: code/cs598-AIE/two_nodes/train_t5_two_nodes0.slurm and code/cs598-AIE/two_nodes/train_t5_two_nodes1.slurm
Output: code/cs598-AIE/two_nodes/t5_node0.5719534.out and code/cs598-AIE/two_nodes/t5_node0.5719535.out

Run Command:

sbatch two_nodes/train_t5_two_nodes0.slurm

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

CS598-AIE Experiments

Table of Contents

Setup

Experiments

Experiment 1: Deepspeed

Experiment 2: FSDP

Experiment 3: GPU to GPU transfer for LLAMA 3.2 3B

Experiment 4A: T5 training with GPU checkpointing

Experiment 4B: T5 training with GPU checkpointing

Files

README.md

Latest commit

History

README.md

File metadata and controls

CS598-AIE Experiments

Table of Contents

Setup

Experiments

Experiment 1: Deepspeed

Experiment 2: FSDP

Experiment 3: GPU to GPU transfer for LLAMA 3.2 3B

Experiment 4A: T5 training with GPU checkpointing

Experiment 4B: T5 training with GPU checkpointing