Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Outline arch ideas #184

Draft
wants to merge 4 commits into
base: main
Choose a base branch
from
Draft
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
94 changes: 94 additions & 0 deletions doc/DEV.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,3 +43,97 @@ TBD

## Systems
TBD


# Desired (?) class diagram

```mermaid
---
title: Cloud AI
---
classDiagram
TestRun *-- Test
Job *-- TestRun
Job *-- System

class System {
+run_cmd()
+get_dir()
+get_file()
+...()
}
class Test {
cmd_args
+cmd_as_dict()
+cmd_as_str()
}
class TestRun {
Test test
TestRun[] dependencies
Path _base_output_dir
int num_nodes
str[] nodes
str time_limit

+output_dir()
}
class Job {
int job_id
TestRun tr
System system

+run()
+kill()

+job_id()
+is_running()
+is_done()
+status()
}
```
1. `Test` is a `TestDefinition` from Pydantic intro PR. It is a test with all arguments. Basically, it is a reflection of a Test.toml, where all params are defined or default values are used.
1. `TestRun` is a `Test` instance with `System`-specific parameters, like `num_nodes` for Slurm system.
1. `Job` is a single runnable unit. `Job` knows how to interact with the system to get required information like job status. It can consist of a single `TestRun` or multiple `TestRun`s. For Slurm system this means that a single sbatch script can contain one or multiple tests.
1. Q: `Test` can have System-dependent parameters.
True, but we define `Test` as a set of args and even right now all such parameters are defined inside Test TOML config and not coming from the System level. If we still need to have it, `TestRun` should be able to add/override such parameters.

Notes and thoughts:
1. `BaseRunner` and derivatives to be merged into Job class.
1. Command line contruction: `Test.cmd_as_*()` methods generate actual test's command line. `Job` and `TestRun` are responsible to generate launcher command line and insert test's command line into it. For Slurm, this means that `srun` args are managed by `Job` and `TestRun`, `Test` should not know about srun.

# Execution flow through the system (current logic)
```mermaid
flowchart TB
ur(A user runs cloudai specifying CLI arguments)
ur --> parsing
sp --> ttp
ttp --> tp
tp --> tsp

subgraph parsing
sp(First, System TOML is parsed into System object. It contains all system-specific parameters, plus output and installation directories.)
ttp(Then, all Test Template TOMLs are parsed into TestTemplate objects. These objects contain env and cmd arguments. TestTemplate requires a System object to be construct Strategies: Install, CommandGen, etc. *)
tp(Then, all Test TOMLs are parsed into Test objects. These objects contain TestTemplate as a property. And again, use env and cmd arguments, but also have extra_env and extra_cmd arguments.)
tsp(Finally, Test Scenario TOML is parsed into TestScenario object. It constructs TestRun objects, which contain Test objects and some run-specific parameters like num_nodes.)
end

jr --> st
st --> jm
parsing --> execution
subgraph execution
jr(BaseRunner derivative is created to manage job execution. It takes System and TestSceanrio objects as arguments.)
st(BaseRunner._submit_test is called to produce Job object.)
jm(Job is monitored by BaseRunner.)
end

execution --> report_generation
subgraph report_generation
end
```
\* Some Strategies are inherited from TestTemplateStrategy and require System, env, and cmd arguments for their construction. Other Strategies fo not require any arguments.

## output directory
Output directory is set per Cloud AI invocation. It is constructed as follows:
1. `BaseOutputDir = System.output_directory + TestScenario.name + CurrentTime`
1. Each test then adds its own subdirectory to the output directory like `BaseOutputDir/TestName`