CS598-AIE Experiments

This repository contains scripts and configurations for running various experiments related to deep learning, model training, and GPU-based optimizations. Follow the instructions to set up the environment and execute each experiment.

Setup

Create a Virtual Environment
- Ensure you have Python installed on your system.
- Create a virtual environment:
```
python -m venv env
```
- Activate the virtual environment:
```
# On Linux/Mac
source env/bin/activate

# On Windows
.\env\Scripts\activate
```
- Disclaimer: There may be missing packages and the SLURM scripts may need to be adjusted.
Install Dependencies
- Install required packages:
```
pip install -r requirements.txt
```
Setup Hugging Face
- Configure your Hugging Face account to use models:
```
huggingface-cli login
```
Setup WandB
- Log in to Weights & Biases:
```
wandb login
```

Experiments

Experiment 1: Deepspeed

Objective: Fine-tune LLAMA 3.2 3B model using DeepSpeed for optimized distributed training.
Script: code/cs598-AIE/train-llamaDS.slurm
Output: code/cs598-AIE/experiment1.out
Run Command:
```
sbatch train-llamaDS.slurm
```

Experiment 2: FSDP

Objective: Perform summarization tasks using Fully Sharded Data Parallel (FSDP) for Fine-tune LLAMA 3.2 3B model.
Script: code/cs598-AIE/accelerate_variants/summarization.slurm
Output: code/cs598-AIE/experiment2.out

Run Command:

sbatch accelerate_variants/summarization.slurm

Experiment 3: GPU to GPU transfer for LLAMA 3.2 3B

Objective: Test GPU-to-GPU data transfer to improve performance for asnyc checkpointing.
Script: code/cs598-AIE/GPUtransfer-basic.slurm
Output: code/cs598-AIE/experiment3.out
Run Command:
```
sbatch GPUtransfer-basic.slurm
```

Experiment 4A: T5 training with GPU checkpointing

Objective: This experiment focuses on optimizing training processes by caching heckpoints on GPUs within the same node.
Script: code/cs598-AIE/train_t5.slurm
Output: code/cs598-AIE/experiment4.out
Run Command:
```
sbatch train_t5.slurm
```

Experiment 4B: T5 training with GPU checkpointing

Objective: This experiment focuses on optimizing training processes by caching checkpoints on GPUs across nodes.
Script: code/cs598-AIE/two_nodes/train_t5_two_nodes0.slurm and code/cs598-AIE/two_nodes/train_t5_two_nodes1.slurm
Output: code/cs598-AIE/two_nodes/t5_node0.5719534.out and code/cs598-AIE/two_nodes/t5_node0.5719535.out

Run Command:

sbatch two_nodes/train_t5_two_nodes0.slurm

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
accelerate_variants		accelerate_variants
two_nodes		two_nodes
.gitignore		.gitignore
.gitmodules		.gitmodules
GPUtransfer-basic.cu		GPUtransfer-basic.cu
GPUtransfer-basic.slurm		GPUtransfer-basic.slurm
GPUtransfer-compression.cu		GPUtransfer-compression.cu
GPUtransfer-compression.slurm		GPUtransfer-compression.slurm
LICENSE		LICENSE
README.md		README.md
convertFile.py		convertFile.py
convertFolder.py		convertFolder.py
cpu.slurm		cpu.slurm
ds_config.json		ds_config.json
experiment1.out		experiment1.out
experiment2.out		experiment2.out
experiment3.out		experiment3.out
experiment4.out		experiment4.out
requirements.txt		requirements.txt
train-deepspeed.py		train-deepspeed.py
train-llamaDS.slurm		train-llamaDS.slurm
train_t5.py		train_t5.py
train_t5.slurm		train_t5.slurm
trainer.py		trainer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CS598-AIE Experiments

Table of Contents

Setup

Experiments

Experiment 1: Deepspeed

Experiment 2: FSDP

Experiment 3: GPU to GPU transfer for LLAMA 3.2 3B

Experiment 4A: T5 training with GPU checkpointing

Experiment 4B: T5 training with GPU checkpointing

About

Releases

Packages

Contributors 2

Languages

License

Bhagyashreet20/cs598-AIE

Folders and files

Latest commit

History

Repository files navigation

CS598-AIE Experiments

Table of Contents

Setup

Experiments

Experiment 1: Deepspeed

Experiment 2: FSDP

Experiment 3: GPU to GPU transfer for LLAMA 3.2 3B

Experiment 4A: T5 training with GPU checkpointing

Experiment 4B: T5 training with GPU checkpointing

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages