Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Scalability Tutorial #262

Merged
merged 77 commits into from
Jan 9, 2025
Merged
Show file tree
Hide file tree
Changes from 75 commits
Commits
Show all changes
77 commits
Select commit Hold shift + click to select a range
fa3dc1f
add empty requirements file for cuda
jarlsondre Nov 11, 2024
e9babf9
add requirements files and update pyproject toml
jarlsondre Nov 11, 2024
e994bf4
update pyproject
jarlsondre Nov 11, 2024
4b32a05
update installation in pyproject.toml
jarlsondre Nov 12, 2024
39e5801
update readme and horovod installation script
jarlsondre Nov 12, 2024
c9d786b
update readme with horovod explanation
jarlsondre Nov 12, 2024
8932f36
update horovod installation script
jarlsondre Nov 13, 2024
0906e33
update readme with -e flag
jarlsondre Nov 13, 2024
0d588ad
fix linter readme errors
jarlsondre Nov 13, 2024
750618f
add more info to readme
jarlsondre Nov 13, 2024
00f4454
trailing whitespace 🙃
jarlsondre Nov 13, 2024
ae89e0c
trailing whitespace 🙃 (again)
jarlsondre Nov 13, 2024
149a536
add draft of table of contents to readme
jarlsondre Nov 13, 2024
337ebd9
update readme toc
jarlsondre Nov 13, 2024
7b1cff9
update readme toc again
jarlsondre Nov 13, 2024
2457826
add section about uv lock to readme
jarlsondre Nov 13, 2024
4940963
update toc of readme
jarlsondre Nov 13, 2024
ddc7d13
fix errors in readme
jarlsondre Nov 14, 2024
abff6c1
add version numbers to packages in pyproject.toml
jarlsondre Nov 14, 2024
4eb5352
remove uv.lock (for now)
jarlsondre Nov 14, 2024
c9cbcef
remove link from readme
jarlsondre Nov 14, 2024
eb163ef
put toc in html comment
jarlsondre Nov 14, 2024
a99a674
remove toc, remove ds and horovod from reqs, add docs comment to pyproj
jarlsondre Nov 14, 2024
61e8574
Itwinai jlab Docker image (#236)
matbun Nov 14, 2024
d38385e
Virgo HDF5 file format (#240)
jarlsondre Nov 15, 2024
c51a1c4
add requirements files and update pyproject toml
jarlsondre Nov 11, 2024
76c7863
update installation in pyproject.toml
jarlsondre Nov 12, 2024
468ef94
add pytorch extra to horovod and remove redundant script
jarlsondre Nov 15, 2024
b0cd8ac
update readme tutorial with pip installation
jarlsondre Nov 15, 2024
0bd9a0a
add uv tutorial in separate file
jarlsondre Nov 15, 2024
4b1876b
fix linting errors
jarlsondre Nov 15, 2024
737f70b
update horovod install script
jarlsondre Nov 15, 2024
b8863bd
Merge branch 'uv-package-manager' of github.com:interTwin-eu/itwinai …
jarlsondre Nov 15, 2024
eb8cb08
fix dead link
jarlsondre Nov 15, 2024
7a784f5
update readme
jarlsondre Nov 19, 2024
3ac9313
add uv installation command to readme
jarlsondre Nov 19, 2024
f751912
add requirements files and update pyproject toml
jarlsondre Nov 11, 2024
6f9c5c1
update pyproject
jarlsondre Nov 11, 2024
6e65624
update installation in pyproject.toml
jarlsondre Nov 12, 2024
0a731ed
add version numbers to packages in pyproject.toml
jarlsondre Nov 14, 2024
def18fd
update horovod install script and add pip as dependency
jarlsondre Nov 19, 2024
7379659
fix merge conflicts
jarlsondre Nov 19, 2024
6c8f4db
formatting
jarlsondre Nov 19, 2024
690bed3
fix linting
jarlsondre Nov 19, 2024
9412e48
trailing whitespace
jarlsondre Nov 19, 2024
a23583a
remove comment from readme
jarlsondre Nov 19, 2024
60cbc6f
remove comments and small formatting difference
jarlsondre Nov 19, 2024
dac8d1e
fix profiler bug where profiler is never set to trainer
jarlsondre Nov 21, 2024
c73fad0
begin refactoring the scaling tests
jarlsondre Nov 21, 2024
92c56b6
merge main into branch
jarlsondre Dec 2, 2024
02d15e6
add contributors
jarlsondre Dec 2, 2024
3fd0036
fix linting errors
jarlsondre Dec 2, 2024
6971d42
update scaling test trainers
jarlsondre Dec 3, 2024
e2fa5ea
update plotting code and small bugfix in profiler
jarlsondre Dec 3, 2024
7629c93
tiny update to requirements
jarlsondre Dec 3, 2024
8bcca46
reformat wrt indentations and newlines
jarlsondre Dec 3, 2024
cb971a1
fix layout of plot and use update comm regexes
jarlsondre Dec 3, 2024
65a6334
merge
jarlsondre Dec 16, 2024
a140dd0
more clean up [WIP]
jarlsondre Dec 16, 2024
9f7b1ab
update deepspeed trainer
jarlsondre Dec 17, 2024
4301251
some cleanup
jarlsondre Dec 17, 2024
72a0f37
small cleanup
jarlsondre Dec 18, 2024
a9ec768
fix deepspeed in scalability tutorial
jarlsondre Jan 6, 2025
11e3788
add subset to horovod so it finishes in time
jarlsondre Jan 6, 2025
77a2b91
small cleanup in itwinai trainer
jarlsondre Jan 6, 2025
10e8e2c
update default slurm log dir name
jarlsondre Jan 6, 2025
dba583f
update slurm log directory in config files
jarlsondre Jan 6, 2025
3604aa5
allow user to specify number of nodes for scalability analysis
jarlsondre Jan 7, 2025
003a683
allow user to specify imagenet subset size
jarlsondre Jan 7, 2025
e50bd19
enable epoch time logging for tutorial
jarlsondre Jan 7, 2025
36ff234
update readme
jarlsondre Jan 7, 2025
7ffe353
add folder for scalability metrics
jarlsondre Jan 7, 2025
5b9a469
fix linting errors
jarlsondre Jan 7, 2025
4b6705f
remove import comments in itwinai trainer file
jarlsondre Jan 7, 2025
6897039
sort imports
jarlsondre Jan 7, 2025
5f874f0
small cleanup: comments from PR
jarlsondre Jan 9, 2025
1501646
fix virgo config
jarlsondre Jan 9, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions env-files/tensorflow/generic_tf.sh
Original file line number Diff line number Diff line change
@@ -1,5 +1,15 @@
#!/bin/bash

# --------------------------------------------------------------------------------------
# Part of the interTwin Project: https://www.intertwin.eu/
#
# Created by: Matteo Bunino
#
# Credit:
# - Jarl Sondre Sæther <[email protected]> - CERN
# - Matteo Bunino <[email protected]> - CERN
# --------------------------------------------------------------------------------------

if [ -z "$ENV_NAME" ]; then
ENV_NAME=".venv-tf"
fi
Expand Down
11 changes: 11 additions & 0 deletions env-files/torch/generic_torch.sh
Original file line number Diff line number Diff line change
@@ -1,4 +1,15 @@
#!/bin/bash

# --------------------------------------------------------------------------------------
# Part of the interTwin Project: https://www.intertwin.eu/
#
# Created by: Matteo Bunino
#
# Credit:
# - Jarl Sondre Sæther <[email protected]> - CERN
# - Matteo Bunino <[email protected]> - CERN
# --------------------------------------------------------------------------------------

if [ -z "$ENV_NAME" ]; then
ENV_NAME=".venv-pytorch"
fi
Expand Down
10 changes: 10 additions & 0 deletions env-files/torch/install-horovod-deepspeed-cuda.sh
Original file line number Diff line number Diff line change
@@ -1,5 +1,15 @@
#!/bin/bash

# --------------------------------------------------------------------------------------
# Part of the interTwin Project: https://www.intertwin.eu/
#
# Created by: Jarl Sondre Sæther
#
# Credit:
# - Jarl Sondre Sæther <[email protected]> - CERN
# - Matteo Bunino <[email protected]> - CERN
# --------------------------------------------------------------------------------------

# DeepSpeed variables
export DS_BUILD_CCL_COMM=1
export DS_BUILD_UTILS=1
Expand Down
1 change: 1 addition & 0 deletions src/itwinai/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -372,6 +372,7 @@ def exec_pipeline(
print(json.dumps(parser.config, indent=2))
print("#=" * 50)
print()

pipeline = parser.parse_pipeline(pipeline_nested_key=pipe_key)
if steps:
if not re.match(r"\d+(:\d+)?(:\d+)?", steps):
Expand Down
2 changes: 1 addition & 1 deletion src/itwinai/loggers.py
Original file line number Diff line number Diff line change
Expand Up @@ -1177,7 +1177,7 @@ class EpochTimeTracker:
"""Tracker for epoch execution time during training."""

def __init__(
self, strategy_name: str, save_path: Union[Path, str], num_nodes: int
self, strategy_name: str, save_path: Path | str, num_nodes: int
) -> None:
if isinstance(save_path, str):
save_path = Path(save_path)
Expand Down
2 changes: 2 additions & 0 deletions src/itwinai/scalability.py
Original file line number Diff line number Diff line change
Expand Up @@ -108,6 +108,8 @@ def create_absolute_plot(avg_epoch_time_df: pd.DataFrame) -> None:
ax.grid(True)

output_path = Path("plots/absolute_scalability_plot.png")
output_path.parent.mkdir(parents=True, exist_ok=True)
plt.tight_layout()
plt.savefig(output_path)
print(f"Saving absolute plot to '{output_path.resolve()}'.")
sns.reset_orig()
Expand Down
4 changes: 2 additions & 2 deletions src/itwinai/slurm/slurm_config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,8 @@ account: intertwin
dist_strat: horovod
time: 00:11:11

std_out: slurm_jobs/job.out
err_out: slurm_jobs/err.out
std_out: slurm_job_Logs/job.out
jarlsondre marked this conversation as resolved.
Show resolved Hide resolved
err_out: slurm_job_Logs/err.out

num_nodes: 1
num_tasks_per_node: 1
Expand Down
14 changes: 8 additions & 6 deletions src/itwinai/slurm/slurm_script_builder.py
Original file line number Diff line number Diff line change
Expand Up @@ -172,6 +172,8 @@ def get_debug_command(self) -> str:
echo ""
echo "### Other Variables ###"
echo "Distributed Strategy: {self.distributed_strategy}"
echo "Current working directory: $(pwd)"
echo "Which python: $(which python)"
"""
debug_print_command = debug_print_command.strip()
return remove_indentation_from_multiline_string(debug_print_command)
Expand Down Expand Up @@ -201,10 +203,10 @@ def process_slurm_script(
self.slurm_script_configuration.job_name = self.generate_identifier()

if self.slurm_script_configuration.std_out is None:
std_out_path = Path("slurm_jobs") / (self.generate_identifier() + ".out")
std_out_path = Path("slurm_job_logs") / (self.generate_identifier() + ".out")
self.slurm_script_configuration.std_out = std_out_path
if self.slurm_script_configuration.err_out is None:
err_out_path = Path("slurm_jobs") / (self.generate_identifier() + ".err")
err_out_path = Path("slurm_job_logs") / (self.generate_identifier() + ".err")
self.slurm_script_configuration.err_out = err_out_path

# Making sure the std out and err out folders exist
Expand All @@ -218,9 +220,9 @@ def process_slurm_script(
# Generate the script using the given configuration
script = self.slurm_script_configuration.format_script()
if not submit_slurm_job and not retain_file:
print("#" * 30)
print("#" * 20, "SLURM Script Preview", "#"*20)
print(script)
print("#" * 30)
print("#" * 62)
return

if file_path is None:
Expand Down Expand Up @@ -258,8 +260,8 @@ def run_slurm_script_all_strategies(

# Overriding job_name, std_out and err_out
self.slurm_script_configuration.job_name = self.generate_identifier()
std_out_path = Path("slurm_jobs") / (self.generate_identifier() + ".out")
err_out_path = Path("slurm_jobs") / (self.generate_identifier() + ".err")
std_out_path = Path("slurm_job_logs") / (self.generate_identifier() + ".out")
err_out_path = Path("slurm_job_logs") / (self.generate_identifier() + ".err")
self.slurm_script_configuration.std_out = std_out_path
self.slurm_script_configuration.err_out = err_out_path

Expand Down
40 changes: 34 additions & 6 deletions src/itwinai/slurm/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,8 @@
# - Jarl Sondre Sæther <[email protected]> - CERN
# --------------------------------------------------------------------------------------

from typing import List

from itwinai.parser import ArgumentParser


Expand All @@ -18,6 +20,31 @@ def remove_indentation_from_multiline_string(multiline_string: str) -> str:
return "\n".join([line.lstrip() for line in multiline_string.split("\n")])


def scalability_nodes_list(value: str | List[int]) -> List[int]:
"""Checks that the value it receives conforms to the comma-separated integer
constraint and returns the parsed list if successful.

Returns:
The list of integers that was parsed.

Raises:
ValueError: If unable to parse the integers e.g. due to formatting errors.
"""

if isinstance(value, list):
if not all([isinstance(x, int) for x in value]):
raise ValueError(f"Provided list, '{value}', contains non-integer values.")
else:
return value

try:
return [int(n) for n in value.split(",")]
except ValueError:
raise ValueError(
f"Invalid input: '{value}', must be formatted as comma-separated integers."
)


def get_slurm_job_parser() -> ArgumentParser:
# Default arguments for the SLURM script configuration
default_account = "intertwin"
Expand All @@ -38,16 +65,11 @@ def get_slurm_job_parser() -> ArgumentParser:
default_pipe_key = "rnn_training_pipeline"
default_training_command = None
default_python_venv = ".venv"
default_scalability_nodes = "1,2,4,8"

parser = ArgumentParser(parser_mode="omegaconf")

# Arguments specific to the SLURM script configuration
parser.add_argument(
"--job_name",
type=str,
default=default_job_name,
help="The name of the SLURM job",
)
parser.add_argument(
"--job-name",
type=str,
Expand Down Expand Up @@ -142,6 +164,12 @@ def get_slurm_job_parser() -> ArgumentParser:
default=default_python_venv,
help="Which python venv to use for running the command.",
)
parser.add_argument(
"--scalability-nodes",
type=scalability_nodes_list,
default=default_scalability_nodes,
help="A comma-separated list of node numbers to use for the scalability test.",
)

# Boolean arguments where you only need to include the flag and not an actual value
parser.add_argument(
Expand Down
5 changes: 3 additions & 2 deletions src/itwinai/torch/monitoring/plotting.py
Original file line number Diff line number Diff line change
Expand Up @@ -107,6 +107,7 @@ def gpu_bar_plot(
raise ValueError(
f"DataFrame is missing the following columns: {missing_columns}"
)

sns.set_theme()

strategies = data_df["strategy"].unique()
Expand Down Expand Up @@ -138,9 +139,9 @@ def gpu_bar_plot(
ax.set_xticklabels(unique_gpu_counts)
ax.legend(title="Strategy")

figure_width = int(1.5 * len(unique_gpu_counts))
fig.set_figheight(6)
figure_width = max(int(2 * len(unique_gpu_counts)), 8)
fig.set_figwidth(figure_width)
fig.set_figheight(figure_width * 0.8)

sns.reset_orig()

Expand Down
19 changes: 12 additions & 7 deletions src/itwinai/torch/profiling/communication_plot.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,6 @@
import seaborn as sns
from matplotlib.patches import Patch

# from itwinai.scalability import convert_matching_files_to_dataframe

# Doing this because otherwise I get an error about X11 Forwarding which I believe
# is due to the server trying to pass the image to the client computer
matplotlib.use("Agg")
Expand All @@ -40,9 +38,15 @@ def calculate_comp_and_comm_time(df: pd.DataFrame) -> Tuple[float, float]:
f"\nMissing columns: {missing_columns}"
)

nccl_comm_pattern = (
r"ncclKernel_(?:AllReduce|Broadcast|Reduce|AllGather|ReduceScatter|SendRecv)"
)
comm_types = [
"AllReduce",
"Broadcast",
"Reduce",
"AllGather",
"Gather",
"ReduceScatter",
]
nccl_comm_pattern = rf"(?:{'|'.join(comm_types)})"
cuda_stream_pattern = r"cudaStream(?:WaitEvent|Synchronize)"

# Any operation that is a part of PyTorch's ATen library is considered a computation
Expand Down Expand Up @@ -133,10 +137,11 @@ def communication_overhead_stacked_bar_plot(
ax.legend(handles=ax.get_legend_handles_labels()[0] + [hatch_patch])

# Dynamically adjusting the width of the figure
figure_width = int(1.5 * len(gpu_numbers))
fig.set_figheight(5)
figure_width = max(int(2 * len(gpu_numbers)), 8)
fig.set_figwidth(figure_width)
fig.set_figheight(figure_width * 0.8)

# Resetting so that seaborn's theme doesn't affect other plots
sns.reset_orig()

return fig, ax
Expand Down
4 changes: 3 additions & 1 deletion src/itwinai/torch/profiling/profiler.py
Original file line number Diff line number Diff line change
Expand Up @@ -89,13 +89,15 @@ def profiled_method(self: TorchTrainer, *args, **kwargs) -> Any:
warmup_epochs=self.profiling_warmup_epochs,
)
with profile(
activities=[ProfilerActivity.CUDA],
activities=[ProfilerActivity.CUDA, ProfilerActivity.CPU],
schedule=schedule(
wait=wait_epochs,
warmup=warmup_epochs,
active=active_epochs,
),
with_modules=True
) as profiler:
self.profiler = profiler
result = method(self, *args, **kwargs)

strategy = self.strategy
Expand Down
3 changes: 2 additions & 1 deletion src/itwinai/torch/trainer.py
Original file line number Diff line number Diff line change
Expand Up @@ -422,7 +422,8 @@ def set_epoch(self, epoch: int) -> None:
Args:
epoch (int): epoch number, from 0 to ``epochs-1``.
"""
if self.profiler is not None:
if self.profiler is not None and epoch > 0:
# We don't want to start stepping until after the first epoch
self.profiler.step()
self._set_epoch_dataloaders(epoch)

Expand Down
Loading
Loading