-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding multinode example job + bandwidth benchmark #5
base: main
Are you sure you want to change the base?
Conversation
TJ-Solergibert
commented
Apr 24, 2024
- Little refractor of the docs, including the workflow to run batched jobs (Like sbatch in slurm) and properly store stdout & stderr.
- An example of a very simple distributed application with PyTorch, detailing the most important parameters for running distributed jobs and including all the scripts to replicate the experiment.
- A bandwidth benchmark to justify the use of RDMA. I could only run the benchmark with 2 nodes because there are some jobs in the cluster. I submitted for this night an execution with 4 nodes, let's see if it runs (Is just a 1 minute benchmark).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for adding the benchmark and updating the doc. I did a pass and left comments.
docs/multinode.md
Outdated
To execute jobs in RCP, we will use the RunAI CLI, more specifically the `submit-dist pytorch` function, which will be responsible for launching the specified command on each pod. There are two ways to execute distributed applications: | ||
1. Interactive sessions. To force interactive sessions, we will have to launch the command `sleep infinity` on each pod. This way, we can connect to each pod, but we will have to manually execute the jobs on each one. This is useful for short sessions for debugging applications and checking that everything works correctly before launching a longer job. | ||
> [!TIP] | ||
> Keep in mind that as soon as you disconnect from the pod, you will lose the current job you are executing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sentence is suggesting RunAI kills the pod as soon as you disconnect. This is not correct and is confusing. You can launch long running training jobs by simply using applications such as tmux or screen. Please remove.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you've worked with Slurm, just by reading the title of these two points, you realize that I've focused on porting their functionalities (Interactive session and batched execution) to runai. The fundamental reason is that if you're going to launch a job on multiple nodes (which is the focus of this document), it's not practical to do what you mentioned, using sleep infinity + manually launching the jobs node by node, not to mention the potential errors you can make. The way I propose to launch the jobs addresses this issue and also centralizes the storage of stdout + stderr (Even using the JOB_UUID
, like you would in a Slurm cluster). Moreover, in most clusters, interactive sessions have a node limit that is much lower than that of batched executions, because they want you to run long and big jobs with the batched execution.
1. Interactive sessions. To force interactive sessions, we will have to launch the command `sleep infinity` on each pod. This way, we can connect to each pod, but we will have to manually execute the jobs on each one. This is useful for short sessions for debugging applications and checking that everything works correctly before launching a longer job. | ||
> [!TIP] | ||
> Keep in mind that as soon as you disconnect from the pod, you will lose the current job you are executing. | ||
2. Batched execution. In this mode, we will specify to the `submit-dist` function to execute a script, and it will defer execution until the requested resources are available. This is the recommended way to launch longer jobs such as model training. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think using sleep infinity works on any training jobs on RunAI to keep it running. The only thing that is different is how to connect to the different nodes and execute a command on it. It seems runai itself doesn't allow this (or maybe i'm wrong?). but it's possible to use kubectl get pods to get the name of the nodes and then use kubectl exec -it with those names.
I suggest moving this discussion to the end of the document instead of the beginning so we can directly focus on what is different first.
Note that it is not possbile to control how these pods are scheduled so these two pods can be either on the same node or on different nodes. For best performance, local GPUs should be maximized, which would mean asking for pods of 8 GPUs each (taking a full node). | ||
To configure the number of nodes and GPUs, we will use the following flags of the `submit-dist` function: | ||
1. `--workers`: The total number of nodes will be `n_workers` + 1, as RunAI adds a master node by default. | ||
2. `--gpu`: The number of GPUs per node. Unless debugging applications, set this value as the number of GPUs per node. Otherwise, it would be possible to orchestrate 2 pods on the same node, which would not make sense. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you removed the example command. Can you add it back please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
main.py | ||
``` | ||
|
||
torchrun automatically launches a separate process for each GPU and assigns the correct global rank. As such, for basic usage (e.g. FSDP), no changes to python code is necessary. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should keep this instead of just relying on the example. Please add it back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added in the paragraph were I talk about torchrun
that it sets the correct values for the WORLD_SIZE
and RANK
environment variables. torchrun
is not launching a process per each GPU, but --nproc-per-node
processes. It is true that for GPU jobs we use 1 process per gpu, I've added a comment about setting --nproc-per-node
to the number of GPUs per node.
@@ -0,0 +1,17 @@ | |||
START TIME: Wed Apr 24 17:12:50 UTC 2024 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think these should have been pushed. please remove all the reports.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
``` | ||
|
||
Note the following: | ||
1. We aren't requesting any GPU, as the application doesn't needs any. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit confused about this. What is the example showing if we are not using GPUs?
Overall, going through the rest of the code, I suggest removing the my_first_distributed_app and only include the benchmark. I think it already serves as a very nice example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, the example doesn't use GPUs at all. The point of this example is to demonstrate how to launch jobs with torchrun
, measure times, use the RANK
and WORLD_SIZE
environment variables, and perform a collective call (dist.reduce
). I know that most people use applications that run on the GPU, but precisely everything I demonstrate with this example is hidden. The example I've chosen to calculate the number PI is a basic exercise from any MPI/parallelism book.
--tee 3 \ | ||
" | ||
|
||
PYTHON_FILE=/mloscratch/homes/solergib/getting-started/utils/distributed_pytorch/benchmark/all_reduce_bench.py |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be changed to target the all_reduce_bench.py in the repo (relative referencing with respect to the script's path should work)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done! But the user has to set -e PATH_TO_ROOT_REPO=/mloscratch/homes/solergib/getting-started \
when launching the job with runai (That's why I usually prefer abs paths)
export NCCL_SOCKET_NTHREADS=4 | ||
export NCCL_NSOCKS_PERTHREAD=8 | ||
|
||
# MASTER_ADDR -> The IP of the master node |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not call utils/distributed_pytorch/benchmark/RCP_run_benchmark_no_RDMA.sh here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you launch a job with--annotation k8s.v1.cni.cncf.io/networks=kube-system/roce --extended-resource rdma/rdma=1
you get an IP address that only works with rdma. In short, running with the rdma plugin but not using it fails the job.
#!/bin/bash | ||
|
||
echo "START TIME: $(date)" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe add a flag here instead of two files to specify whether to use rdma or not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried adding the rdma flag, but, it's not just that! You also have to remove the --annotation k8s.v1.cni.cncf.io/networks=kube-system/roce --extended-resource rdma/rdma=1
from the runai call. I've just lost 10 minutes trying to figure out why it wasn't working, so let's keep it like that (Not just adding a simple flag BUT also changing runai launch config --> Prone to errors)