nb-levanter

NB Levanter configs and utils

TPU creation

Check zones for allowed quota

TPUv4-32 on-demand

export TPU_NAME=levanter-pod-32
gcloud alpha compute tpus queued-resources create $TPU_NAME --node-id $TPU_NAME --project mimir-411610 --zone us-central2-b --accelerator-type v4-32 --runtime-version tpu-vm-v4-base

TPUv4-32 pre-emptible

export TPU_NAME=levanter-pod-32-pre
gcloud alpha compute tpus queued-resources create $TPU_NAME --node-id $TPU_NAME --project mimir-411610 --zone us-central2-b --accelerator-type v4-32 --runtime-version tpu-vm-v4-base --best-effortt

Check status

gcloud alpha compute tpus queued-resources list --zone us-central2-b

Stop pod

gcloud alpha compute tpus queued-resources stop $TPU_NAME --zone us-central2-b

Delete pod (only stop pods can be deleted)

gcloud alpha compute tpus queued-resources delete $TPU_NAME --zone us-central2-b

Setup

Locally, download ttconnect to connect to the pod using the ubuntu user:

export TPU_NAME=levanter-pod-32
./ttconnect $TPU_NAME ubuntu

Once in the pod, run the next script to create a venv, install dependencies, and mount the NFS volume (first line avoids dialog in interactive mode):

curl -s "https://raw.githubusercontent.com/NbAiLab/nb-levanter/main/infra/helpers/setup-tpu-vm-nfs.sh" | bash

Or this other other one if NFS is not needed:

curl -s "https://raw.githubusercontent.com/NbAiLab/nb-levanter/main/infra/helpers/setup-tpu-vm.sh" | bash

Optionally, mount an NFS volume:

sudo apt-get -qq install -y nfs-common
export NFS_SERVER=10.63.96.66
export MOUNT_POINT="/share"
sudo mkdir -p ${MOUNT_POINT}
export CURRENT_NFS_ENTRY=$(grep ${NFS_SERVER} /etc/fstab)
export DESIRED_NFS_ENTRY="${NFS_SERVER}:/share ${MOUNT_POINT} nfs defaults 0 0"
grep -v "${NFS_SERVER}" /etc/fstab > /tmp/fstab.new
echo "${DESIRED_NFS_ENTRY}" >> /tmp/fstab.new
sudo cp /etc/fstab /etc/fstab.orig
sudo mv /tmp/fstab.new /etc/fstab
sudo mount -a

Optionally, login into Weights and Biases, HuggingFace, and GitHub:

gh auth login
wandb login
hugginface-cli login

Training

Then it's a matter of creating a config in /share/nb-levanter/configs or a GCP bucket and run it in all VMs:

WANDB_API_KEY=<YOUR KEY HERE> HF_TOKEN=$(cat ~/.cache/huggingface/token) levanter/infra/launch.sh python levanter/src/levanter/main/train_lm.py --config_path /share/nb-levanter/configs/mimir-mistral-7b-extended.yaml

For resuming, you can create an extra config file os just invoke the same command but passing in a couple of extra parameters, --trainer.wandb.resume true --trainer.id <WANDB_ID>

Troubleshooting

If getting a BarrierTimeoutException: DEADLINE_EXCEEDED: Barrier timed out when writing checkpoints, try setting TENSORSTORE_CURL_LOW_SPEED_TIME_SECONDS=360 TENSORSTORE_CURL_LOW_SPEED_LIMIT_BYTES=256 to force retry.
When processing very long documents, Ray might get OOM or fail to start or finish tokenization/cahing of the dataset. In this case, it might help to reduce the number of CPUs so the global memory is not exhausted with SLURM_CPUS_ON_NODE=16 TOKENIZERS_PARALLELISM=false.
Some options for optimization (untested): LIBTPU_INIT_ARGS='--xla_jf_spmd_threshold_for_windowed_einsum_mib=0 --xla_tpu_spmd_threshold_for_allgather_cse=10000 --xla_enable_async_all_gather=true --xla_tpu_enable_latency_hiding_scheduler=true TPU_MEGACORE=MEGACORE_DENSE'
Add --trainer.fsdp_axis=null for smaller models (below 1B).

Name		Name	Last commit message	Last commit date
Latest commit History 74 Commits
configs		configs
infra/helpers		infra/helpers
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
ttconnect.sh		ttconnect.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

nb-levanter

TPU creation

Setup

Training

Troubleshooting

About

Releases

Packages

Languages

License

NbAiLab/nb-levanter

Folders and files

Latest commit

History

Repository files navigation

nb-levanter

TPU creation

Setup

Training

Troubleshooting

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages