This repository contains scripts and configurations for running various experiments related to deep learning, model training, and GPU-based optimizations. Follow the instructions to set up the environment and execute each experiment.
-
Create a Virtual Environment
- Ensure you have Python installed on your system.
- Create a virtual environment:
python -m venv env
- Activate the virtual environment:
# On Linux/Mac source env/bin/activate # On Windows .\env\Scripts\activate
- Disclaimer: There may be missing packages and the SLURM scripts may need to be adjusted.
-
Install Dependencies
- Install required packages:
pip install -r requirements.txt
- Install required packages:
-
Setup Hugging Face
- Configure your Hugging Face account to use models:
huggingface-cli login
- Configure your Hugging Face account to use models:
-
Setup WandB
- Log in to Weights & Biases:
wandb login
- Log in to Weights & Biases:
- Objective: Fine-tune LLAMA 3.2 3B model using DeepSpeed for optimized distributed training.
- Script:
code/cs598-AIE/train-llamaDS.slurm
- Output:
code/cs598-AIE/experiment1.out
- Run Command:
sbatch train-llamaDS.slurm
- Objective: Perform summarization tasks using Fully Sharded Data Parallel (FSDP) for Fine-tune LLAMA 3.2 3B model.
- Script:
code/cs598-AIE/accelerate_variants/summarization.slurm
- Output:
code/cs598-AIE/experiment2.out
- Run Command:
sbatch accelerate_variants/summarization.slurm
- Objective: Test GPU-to-GPU data transfer to improve performance for asnyc checkpointing.
- Script:
code/cs598-AIE/GPUtransfer-basic.slurm
- Output:
code/cs598-AIE/experiment3.out
- Run Command:
sbatch GPUtransfer-basic.slurm
- Objective: This experiment focuses on optimizing training processes by caching heckpoints on GPUs within the same node.
- Script:
code/cs598-AIE/train_t5.slurm
- Output:
code/cs598-AIE/experiment4.out
- Run Command:
sbatch train_t5.slurm
- Objective: This experiment focuses on optimizing training processes by caching checkpoints on GPUs across nodes.
- Script:
code/cs598-AIE/two_nodes/train_t5_two_nodes0.slurm and code/cs598-AIE/two_nodes/train_t5_two_nodes1.slurm
- Output:
code/cs598-AIE/two_nodes/t5_node0.5719534.out and code/cs598-AIE/two_nodes/t5_node0.5719535.out
- Run Command:
sbatch two_nodes/train_t5_two_nodes0.slurm