Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding multinode example job + bandwidth benchmark #5

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

TJ-Solergibert
Copy link

  • Little refractor of the docs, including the workflow to run batched jobs (Like sbatch in slurm) and properly store stdout & stderr.
  • An example of a very simple distributed application with PyTorch, detailing the most important parameters for running distributed jobs and including all the scripts to replicate the experiment.
  • A bandwidth benchmark to justify the use of RDMA. I could only run the benchmark with 2 nodes because there are some jobs in the cluster. I submitted for this night an execution with 4 nodes, let's see if it runs (Is just a 1 minute benchmark).

@martinjaggi martinjaggi requested a review from mkrima April 29, 2024 09:28
Copy link
Collaborator

@mkrima mkrima left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for adding the benchmark and updating the doc. I did a pass and left comments.

To execute jobs in RCP, we will use the RunAI CLI, more specifically the `submit-dist pytorch` function, which will be responsible for launching the specified command on each pod. There are two ways to execute distributed applications:
1. Interactive sessions. To force interactive sessions, we will have to launch the command `sleep infinity` on each pod. This way, we can connect to each pod, but we will have to manually execute the jobs on each one. This is useful for short sessions for debugging applications and checking that everything works correctly before launching a longer job.
> [!TIP]
> Keep in mind that as soon as you disconnect from the pod, you will lose the current job you are executing.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sentence is suggesting RunAI kills the pod as soon as you disconnect. This is not correct and is confusing. You can launch long running training jobs by simply using applications such as tmux or screen. Please remove.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you've worked with Slurm, just by reading the title of these two points, you realize that I've focused on porting their functionalities (Interactive session and batched execution) to runai. The fundamental reason is that if you're going to launch a job on multiple nodes (which is the focus of this document), it's not practical to do what you mentioned, using sleep infinity + manually launching the jobs node by node, not to mention the potential errors you can make. The way I propose to launch the jobs addresses this issue and also centralizes the storage of stdout + stderr (Even using the JOB_UUID, like you would in a Slurm cluster). Moreover, in most clusters, interactive sessions have a node limit that is much lower than that of batched executions, because they want you to run long and big jobs with the batched execution.

1. Interactive sessions. To force interactive sessions, we will have to launch the command `sleep infinity` on each pod. This way, we can connect to each pod, but we will have to manually execute the jobs on each one. This is useful for short sessions for debugging applications and checking that everything works correctly before launching a longer job.
> [!TIP]
> Keep in mind that as soon as you disconnect from the pod, you will lose the current job you are executing.
2. Batched execution. In this mode, we will specify to the `submit-dist` function to execute a script, and it will defer execution until the requested resources are available. This is the recommended way to launch longer jobs such as model training.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think using sleep infinity works on any training jobs on RunAI to keep it running. The only thing that is different is how to connect to the different nodes and execute a command on it. It seems runai itself doesn't allow this (or maybe i'm wrong?). but it's possible to use kubectl get pods to get the name of the nodes and then use kubectl exec -it with those names.

I suggest moving this discussion to the end of the document instead of the beginning so we can directly focus on what is different first.

Note that it is not possbile to control how these pods are scheduled so these two pods can be either on the same node or on different nodes. For best performance, local GPUs should be maximized, which would mean asking for pods of 8 GPUs each (taking a full node).
To configure the number of nodes and GPUs, we will use the following flags of the `submit-dist` function:
1. `--workers`: The total number of nodes will be `n_workers` + 1, as RunAI adds a master node by default.
2. `--gpu`: The number of GPUs per node. Unless debugging applications, set this value as the number of GPUs per node. Otherwise, it would be possible to orchestrate 2 pods on the same node, which would not make sense.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you removed the example command. Can you add it back please?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

main.py
```

torchrun automatically launches a separate process for each GPU and assigns the correct global rank. As such, for basic usage (e.g. FSDP), no changes to python code is necessary.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should keep this instead of just relying on the example. Please add it back.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added in the paragraph were I talk about torchrun that it sets the correct values for the WORLD_SIZE and RANK environment variables. torchrun is not launching a process per each GPU, but --nproc-per-node processes. It is true that for GPU jobs we use 1 process per gpu, I've added a comment about setting --nproc-per-node to the number of GPUs per node.

@@ -0,0 +1,17 @@
START TIME: Wed Apr 24 17:12:50 UTC 2024
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think these should have been pushed. please remove all the reports.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

```

Note the following:
1. We aren't requesting any GPU, as the application doesn't needs any.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused about this. What is the example showing if we are not using GPUs?

Overall, going through the rest of the code, I suggest removing the my_first_distributed_app and only include the benchmark. I think it already serves as a very nice example.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, the example doesn't use GPUs at all. The point of this example is to demonstrate how to launch jobs with torchrun, measure times, use the RANK and WORLD_SIZE environment variables, and perform a collective call (dist.reduce). I know that most people use applications that run on the GPU, but precisely everything I demonstrate with this example is hidden. The example I've chosen to calculate the number PI is a basic exercise from any MPI/parallelism book.

--tee 3 \
"

PYTHON_FILE=/mloscratch/homes/solergib/getting-started/utils/distributed_pytorch/benchmark/all_reduce_bench.py
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be changed to target the all_reduce_bench.py in the repo (relative referencing with respect to the script's path should work)

Copy link
Author

@TJ-Solergibert TJ-Solergibert Apr 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done! But the user has to set -e PATH_TO_ROOT_REPO=/mloscratch/homes/solergib/getting-started \ when launching the job with runai (That's why I usually prefer abs paths)

export NCCL_SOCKET_NTHREADS=4
export NCCL_NSOCKS_PERTHREAD=8

# MASTER_ADDR -> The IP of the master node
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not call utils/distributed_pytorch/benchmark/RCP_run_benchmark_no_RDMA.sh here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you launch a job with--annotation k8s.v1.cni.cncf.io/networks=kube-system/roce --extended-resource rdma/rdma=1 you get an IP address that only works with rdma. In short, running with the rdma plugin but not using it fails the job.

#!/bin/bash

echo "START TIME: $(date)"

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add a flag here instead of two files to specify whether to use rdma or not?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried adding the rdma flag, but, it's not just that! You also have to remove the --annotation k8s.v1.cni.cncf.io/networks=kube-system/roce --extended-resource rdma/rdma=1 from the runai call. I've just lost 10 minutes trying to figure out why it wasn't working, so let's keep it like that (Not just adding a simple flag BUT also changing runai launch config --> Prone to errors)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants