Adding multinode example job + bandwidth benchmark #5

TJ-Solergibert · 2024-04-24T20:47:34Z

Little refractor of the docs, including the workflow to run batched jobs (Like sbatch in slurm) and properly store stdout & stderr.
An example of a very simple distributed application with PyTorch, detailing the most important parameters for running distributed jobs and including all the scripts to replicate the experiment.
A bandwidth benchmark to justify the use of RDMA. I could only run the benchmark with 2 nodes because there are some jobs in the cluster. I submitted for this night an execution with 4 nodes, let's see if it runs (Is just a 1 minute benchmark).

mkrima

Thanks a lot for adding the benchmark and updating the doc. I did a pass and left comments.

mkrima · 2024-04-29T09:37:53Z

docs/multinode.md

+To execute jobs in RCP, we will use the RunAI CLI, more specifically the `submit-dist pytorch` function, which will be responsible for launching the specified command on each pod. There are two ways to execute distributed applications:
+1. Interactive sessions. To force interactive sessions, we will have to launch the command `sleep infinity` on each pod. This way, we can connect to each pod, but we will have to manually execute the jobs on each one. This is useful for short sessions for debugging applications and checking that everything works correctly before launching a longer job.
+> [!TIP]
+> Keep in mind that as soon as you disconnect from the pod, you will lose the current job you are executing.


This sentence is suggesting RunAI kills the pod as soon as you disconnect. This is not correct and is confusing. You can launch long running training jobs by simply using applications such as tmux or screen. Please remove.

If you've worked with Slurm, just by reading the title of these two points, you realize that I've focused on porting their functionalities (Interactive session and batched execution) to runai. The fundamental reason is that if you're going to launch a job on multiple nodes (which is the focus of this document), it's not practical to do what you mentioned, using sleep infinity + manually launching the jobs node by node, not to mention the potential errors you can make. The way I propose to launch the jobs addresses this issue and also centralizes the storage of stdout + stderr (Even using the JOB_UUID, like you would in a Slurm cluster). Moreover, in most clusters, interactive sessions have a node limit that is much lower than that of batched executions, because they want you to run long and big jobs with the batched execution.

mkrima · 2024-04-29T09:43:15Z

docs/multinode.md

+1. Interactive sessions. To force interactive sessions, we will have to launch the command `sleep infinity` on each pod. This way, we can connect to each pod, but we will have to manually execute the jobs on each one. This is useful for short sessions for debugging applications and checking that everything works correctly before launching a longer job.
+> [!TIP]
+> Keep in mind that as soon as you disconnect from the pod, you will lose the current job you are executing.
+2. Batched execution. In this mode, we will specify to the `submit-dist` function to execute a script, and it will defer execution until the requested resources are available. This is the recommended way to launch longer jobs such as model training.


I think using sleep infinity works on any training jobs on RunAI to keep it running. The only thing that is different is how to connect to the different nodes and execute a command on it. It seems runai itself doesn't allow this (or maybe i'm wrong?). but it's possible to use kubectl get pods to get the name of the nodes and then use kubectl exec -it with those names.

I suggest moving this discussion to the end of the document instead of the beginning so we can directly focus on what is different first.

mkrima · 2024-04-29T09:43:35Z

docs/multinode.md

-Note that it is not possbile to control how these pods are scheduled so these two pods can be either on the same node or on different nodes. For best performance, local GPUs should be maximized, which would mean asking for pods of 8 GPUs each (taking a full node).
+To configure the number of nodes and GPUs, we will use the following flags of the `submit-dist` function:
+1. `--workers`: The total number of nodes will be `n_workers` + 1, as RunAI adds a master node by default.
+2. `--gpu`: The number of GPUs per node. Unless debugging applications, set this value as the number of GPUs per node. Otherwise, it would be possible to orchestrate 2 pods on the same node, which would not make sense.



you removed the example command. Can you add it back please?

mkrima · 2024-04-29T09:44:44Z

docs/multinode.md

-    main.py
-```
-
-torchrun automatically launches a separate process for each GPU and assigns the correct global rank. As such, for basic usage (e.g. FSDP), no changes to python code is necessary.


I think we should keep this instead of just relying on the example. Please add it back.

I've added in the paragraph were I talk about torchrun that it sets the correct values for the WORLD_SIZE and RANK environment variables. torchrun is not launching a process per each GPU, but --nproc-per-node processes. It is true that for GPU jobs we use 1 process per gpu, I've added a comment about setting --nproc-per-node to the number of GPUs per node.

mkrima · 2024-04-29T09:45:19Z

utils/distributed_pytorch/benchmark/reports/Output_16f84e3c-379e-44ef-a4d0-ea595e1782a1.txt

@@ -0,0 +1,17 @@
+START TIME: Wed Apr 24 17:12:50 UTC 2024


I don't think these should have been pushed. please remove all the reports.

mkrima · 2024-04-29T09:48:38Z

docs/multinode.md

+```
+
+Note the following:
+1. We aren't requesting any GPU, as the application doesn't needs any. 


I'm a bit confused about this. What is the example showing if we are not using GPUs?

Overall, going through the rest of the code, I suggest removing the my_first_distributed_app and only include the benchmark. I think it already serves as a very nice example.

Correct, the example doesn't use GPUs at all. The point of this example is to demonstrate how to launch jobs with torchrun, measure times, use the RANK and WORLD_SIZE environment variables, and perform a collective call (dist.reduce). I know that most people use applications that run on the GPU, but precisely everything I demonstrate with this example is hidden. The example I've chosen to calculate the number PI is a basic exercise from any MPI/parallelism book.

mkrima · 2024-04-29T09:50:13Z

utils/distributed_pytorch/benchmark/RCP_run_benchmark.sh

+    --tee 3 \
+    "
+
+PYTHON_FILE=/mloscratch/homes/solergib/getting-started/utils/distributed_pytorch/benchmark/all_reduce_bench.py


this should be changed to target the all_reduce_bench.py in the repo (relative referencing with respect to the script's path should work)

Done! But the user has to set -e PATH_TO_ROOT_REPO=/mloscratch/homes/solergib/getting-started \ when launching the job with runai (That's why I usually prefer abs paths)

mkrima · 2024-04-29T09:51:13Z

utils/distributed_pytorch/benchmark/RCP_run_benchmark.sh

+export NCCL_SOCKET_NTHREADS=4 
+export NCCL_NSOCKS_PERTHREAD=8
+
+# MASTER_ADDR -> The IP of the master node


why not call utils/distributed_pytorch/benchmark/RCP_run_benchmark_no_RDMA.sh here?

If you launch a job with--annotation k8s.v1.cni.cncf.io/networks=kube-system/roce --extended-resource rdma/rdma=1 you get an IP address that only works with rdma. In short, running with the rdma plugin but not using it fails the job.

mkrima · 2024-04-29T09:53:00Z

utils/distributed_pytorch/benchmark/RCP_run_benchmark.sh

+#!/bin/bash
+
+echo "START TIME: $(date)"
+


maybe add a flag here instead of two files to specify whether to use rdma or not?

I tried adding the rdma flag, but, it's not just that! You also have to remove the --annotation k8s.v1.cni.cncf.io/networks=kube-system/roce --extended-resource rdma/rdma=1 from the runai call. I've just lost 10 minutes trying to figure out why it wasn't working, so let's keep it like that (Not just adding a simple flag BUT also changing runai launch config --> Prone to errors)

TJ-Solergibert added 7 commits April 24, 2024 20:38

refractored distributed jobs

c54824e

added bw benchmark

ff4624f

added my_first_distributed_app

6032aad

deleted table metadata

4de3721

fixed warning messages

e85569e

fixed warning messages

6c029d7

added 4 nodes benchmark

8523871

martinjaggi requested a review from mkrima April 29, 2024 09:28

mkrima requested changes Apr 29, 2024

View reviewed changes

TJ-Solergibert added 5 commits April 29, 2024 19:19

deleted reports

1ea65a0

refractor runai command

e33820a

Updated docs

2d5de3a

added relative paths

c067b9e

doc typo

f712e38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding multinode example job + bandwidth benchmark #5

Adding multinode example job + bandwidth benchmark #5

TJ-Solergibert commented Apr 24, 2024

mkrima left a comment

mkrima Apr 29, 2024

TJ-Solergibert Apr 29, 2024

mkrima Apr 29, 2024

mkrima Apr 29, 2024

TJ-Solergibert Apr 29, 2024

mkrima Apr 29, 2024

TJ-Solergibert Apr 29, 2024

mkrima Apr 29, 2024

TJ-Solergibert Apr 29, 2024

mkrima Apr 29, 2024

TJ-Solergibert Apr 29, 2024

mkrima Apr 29, 2024

TJ-Solergibert Apr 29, 2024 •

edited

Loading

mkrima Apr 29, 2024

TJ-Solergibert Apr 29, 2024

mkrima Apr 29, 2024

TJ-Solergibert Apr 29, 2024

Adding multinode example job + bandwidth benchmark #5

Are you sure you want to change the base?

Adding multinode example job + bandwidth benchmark #5

Conversation

TJ-Solergibert commented Apr 24, 2024

mkrima left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TJ-Solergibert Apr 29, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TJ-Solergibert Apr 29, 2024 •

edited

Loading