Snakemake SLURM Functionality #77

patricktnast · 2024-03-18T23:04:40Z

Snakemake SLURM Functionality

Description

Category:
implementation
JIRA issue: https://jira.ihme.washington.edu/browse/MIC-4905
Research reference:

Changes and notes

Added slurm back to possible execution methods
adjusted snakemake logging a bit

Verification and Testing

added unit tests; ran with pandas, pyspark, r implementations

… feature/pnast/MIC-4905-cluster

patricktnast · 2024-03-19T17:06:35Z

output:

2024-03-19 09:42:43.574 | 0:00:01.063229 | run:90 - Results directory: /mnt/share/homes/pnast/scratch/linker/2024_03_19_09_42_43
2024-03-19 09:42:43.783 | 0:00:01.271658 | _get_required_attribute:164 - Assigning default value for container_engine: 'undefined'
2024-03-19 09:42:43.783 | 0:00:01.272345 | _get_spark_requests:197 - Assigning default values for spark: '{'workers': {'num_workers': 2, 'cpus_per_node': 1, 'mem_per_node': 1, 'time_limit': 1}, 'keep_alive': False}'
2024-03-19 09:42:43.804 | 0:00:01.293292 | run:98 - Results directory: /mnt/share/homes/pnast/scratch/linker/2024_03_19_09_42_43
2024-03-19 09:42:43.879 | 0:00:01.368531 | main:49 - Running Snakemake

[Tue Mar 19 09:42:45 2024]
Job 3: Validating step_1_python_pyspark_distributed input
Reason: Missing output files: /mnt/share/homes/pnast/scratch/linker/2024_03_19_09_42_43/input_validations/step_1_python_pyspark_distributed_validator


[Tue Mar 19 09:42:45 2024]
Job 2: Running Implementation step_1_python_pyspark_distributed
Reason: Missing output files: /mnt/share/homes/pnast/scratch/linker/2024_03_19_09_42_43/intermediate/1_step_1/result.parquet; Input files updated by another job: /mnt/share/homes/pnast/scratch/linker/2024_03_19_09_42_43/input_validations/step_1_python_pyspark_distributed_validator


[Tue Mar 19 09:44:15 2024]
Job 4: Validating step_2_r input
Reason: Missing output files: /mnt/share/homes/pnast/scratch/linker/2024_03_19_09_42_43/input_validations/step_2_r_validator; Input files updated by another job: /mnt/share/homes/pnast/scratch/linker/2024_03_19_09_42_43/intermediate/1_step_1/result.parquet


[Tue Mar 19 09:44:15 2024]
Job 1: Running Implementation step_2_r
Reason: Missing output files: /mnt/share/homes/pnast/scratch/linker/2024_03_19_09_42_43/result.parquet; Input files updated by another job: /mnt/share/homes/pnast/scratch/linker/2024_03_19_09_42_43/intermediate/1_step_1/result.parquet, /mnt/share/homes/pnast/scratch/linker/2024_03_19_09_42_43/input_validations/step_2_r_validator


[Tue Mar 19 09:44:45 2024]
Job 5: Validating results input
Reason: Missing output files: /mnt/share/homes/pnast/scratch/linker/2024_03_19_09_42_43/input_validations/final_validator; Input files updated by another job: /mnt/share/homes/pnast/scratch/linker/2024_03_19_09_42_43/result.parquet


[Tue Mar 19 09:44:45 2024]
Job 0: Grabbing final output
Reason: Input files updated by another job: /mnt/share/homes/pnast/scratch/linker/2024_03_19_09_42_43/input_validations/final_validator, /mnt/share/homes/pnast/scratch/linker/2024_03_19_09_42_43/result.parquet```

patricktnast · 2024-03-19T17:07:25Z

results dir tree:

├── diagnostics
│   ├── 1_step_1
│   │   ├── diagnostics.yaml
│   │   ├── step_1_python_pyspark_distributed-output.log
│   │   └── step_1_python_pyspark_distributed-slurm-11068690.log
│   └── 2_step_2
│       ├── diagnostics.yaml
│       ├── step_2_r-output.log
│       └── step_2_r-slurm-11068739.log
├── environment.yaml
├── input_data.yaml
├── input_validations
│   ├── final_validator
│   ├── step_1_python_pyspark_distributed_validator
│   └── step_2_r_validator
├── intermediate
│   └── 1_step_1
│       └── result.parquet
├── pipeline.yaml
├── result.parquet
└── Snakefile

zmbc

This looks great to me! Want to confirm whether the DRMAA dependency is gone on this branch, I'd like to test on HYAK.

src/linker/rule.py

src/linker/runner.py

src/linker/utilities/paths.py

zmbc · 2024-03-19T22:59:30Z

Just tested this on HYAK -- it doesn't have SLURM_ROOT 😞 When we change it to just check for an "sbatch" command or similar, then it should at least get past that roadblock.

zmbc · 2024-03-19T23:05:32Z

Manually commenting that bit out, I see linker: error: argument --executor/-e: invalid choice: 'slurm' (choose from 'local', 'dryrun', 'touch'). I had to manually run pip install snakemake-executor-plugin-slurm to get past this; it should probably be included in the linker dependencies.

patricktnast · 2024-03-20T18:39:50Z

Manually commenting that bit out, I see linker: error: argument --executor/-e: invalid choice: 'slurm' (choose from 'local', 'dryrun', 'touch'). I had to manually run pip install snakemake-executor-plugin-slurm to get past this; it should probably be included in the linker dependencies.

good catch, fixed

stevebachmeier · 2024-03-20T19:26:57Z

src/linker/rule.py

 rule:
-    name: "{self.name}"
+    name: "{self.implementation_name}"
+    message: "Running {self.step_name} implementation: {self.implementation_name}"


message is snakemake's way of logging to stdout?

yeah it adds to the snakemake logging--example in the stdout posted above

stevebachmeier · 2024-03-20T19:27:25Z

src/linker/pipeline.py

@@ -121,17 +121,24 @@ def write_implementation_rules(
        validation_file = str(
            results_dir / "input_validations" / implementation.validation_filename
        )
+        resources = (


nit: we may want to specify this as "slurm_resources" (as opposed to spark resources or in the future who knows what else)

Unless your thought is the logic will go here eventually regardless of resource type?

Yeah, I think the more general "resources" make sense at least on the rule side, given that we may support non-slurm execution resources, and this would be the place that those would go (though not necessarily for something like spark,)
https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#snakefiles-standard-resources

stevebachmeier · 2024-03-20T19:30:50Z

src/linker/runner.py

@@ -38,6 +39,9 @@ def main(
        ## See above
        "--envvars",
        "foo",
+        ## Suppress some of the snakemake output
+        "--quiet",
+        "progress",


Is this saying to suppross some output (--quiet) but to show progress...bars?

"progress" is the argument to "--quiet", it's suppressing some logging about the progress of execution

stevebachmeier · 2024-03-20T19:33:33Z

src/linker/utilities/slurm_utils.py

@@ -11,6 +11,11 @@
 from linker.configuration import Config


+def is_on_slurm() -> bool:
+    """Returns True if the current environment is a SLURM cluster."""
+    return "SLURM_ROOT" in os.environ


Oh, right, in my unmerged branch for Jenkins a better check is return shutil.which("sbatch") is not None

stevebachmeier · 2024-03-20T19:35:44Z

tests/unit/test_pipeline.py

+        assert snake_str_lines[i].strip() == expected_line.strip()
+
+
+def test_build_snakefile_slurm(default_config_params, mocker, test_dir):


I didn't look super close but these two tests seem to be lots of duplicate code - can you parameterize?

stevebachmeier · 2024-03-20T19:37:30Z

tests/unit/test_slurm_utils.py

@@ -26,6 +27,15 @@
 IN_GITHUB_ACTIONS = os.getenv("GITHUB_ACTIONS") == "true"


+def test_is_on_slurm():


Oh I see, this is just merge conflict stuff - it's fine as is and I'll get the better is_on_slurm in my pr

stevebachmeier · 2024-03-20T19:39:40Z

No e2e tests? (I know they don't work via github actions yet but they should be able to work locally)

stevebachmeier

Looks good! Mostly nits and questions, though I think we need the (locally supported only) e2e tests

patricktnast · 2024-03-20T19:41:59Z

No e2e tests? (I know they don't work via github actions yet but they should be able to work locally)

I was going to do that as a part of https://jira.ihme.washington.edu/secure/RapidBoard.jspa?rapidView=367&projectKey=MIC&view=detail&selectedIssue=MIC-4938#
because I anticipated that it would conflict with the testing regime you created and we should resolve those conflicts first.

discussed offline

* Basic Snakemake functionality (#73) * wrap snakefile into linker run * basic implementation * some yaml stuff * remove snakefile stuff * start writing * write the snakefile better this time * newln * add cache and remote * remove drmaa log dir * remove common.smk * add Rule dataclass * remove unused utils * refactor where validations live * remove snakemake utils * cleanup * lint * add validation rules * reverse input and output * lint * fix some tests * add back implementation config * lint * remove temp * add other script commands * remove slurm supprt for now * remove the false script * add diagnostics back * fix test * add comment * fix typing * fix script cmd to property * revert step change * adjust some tests * revert script change again * add dir for input validations * lint * add some type hints * fix metadata errors * fix script command * add new tests * fix errors * fix test again * use pipeline file * container_path -> image_path * add missing type hints * lint * change to image path * rename ValidationRule and add comment * add comment * fix validations test * lint * rename step validation * ensure rules have right number of lines * added docstrings for rules * add todos, clean up * fix broken test * Add Snakemake Containers (#76) * initial solution * small refactors * fix pyspark and R * lint * fix existing texts * add unit tests * simplify implementation metadata * add snakemake dep * reformat dep * check 3.11 and 3.12 * create linker subdir and rename bind_dir * make paths absolute * fix the test too * Snakemake SLURM Functionality (#77) * add resources * work on logging * configure tmp dir * cleanup * fix broken tests * lint * format output logging * adjust some comments * add local / "slurm" arguments test * add local / slurm specific string tests * lint * better check for if on slurm * use steve's check on slurm * ingnore slurm check on GHA * add executor dep * add step name to rules * parameterize tests --------- Co-authored-by: Steve Bachmeier <[email protected]> * Refactor results_dir into Config (#78) * absorb results dir into config * fix linker tmp * fix numerous tests * delete unintended files * change test output dir * refactor to snake_filepath, also overwrite if exists * refactor some config items * lint * mess around with strings and Paths * change some updates to set value * move copy to results * lint * reset the old umask * wrap umask intermediate in try/finally * wrap umask intermediate in try/finally (#80) * Resolve Merge Conflicts, e2e and integration testing (#79) * Feature/sbachmei/mic 4846 local e2e tests (#74) * better check for if on slurm (#75) * move specs * change mem arg * add new tests * remove todo * add debug arg * move spec dir * fix broken tests * add docstrings * lint * add comments to todos * standardize resource names * remove quoting * fix integration test * fix tests * change to cpus per task * make keys agnostic to slurm or implementation resources --------- Co-authored-by: Steve Bachmeier <[email protected]> * Pin Executor Plugin Package (#81) * add pin * lint * Snakemake Spark (#84) * make attribute public * add requires_spark and additional snakefile declarations * add to rule defs * add snakefle path * add spark snakefile * whoops, we were in minutes * make it a module * adjust spark smk * lint * change number of workers depending on spark * use spark resources * add timeouts * add slurm logs and output file flexibility * lint * adjust existing tests * remove unused code * lint * remove more util tests * fix tests * revert metadata * lint * revert metadata again * remove duplicate escapes * change spark resources * adjust configuration and defaults * allow non-int resources * change a word * revert change from str to int * i did a bad job setting things back to the way they were * remove get jobs helper * remove import * adjust spark alloc * write resources to params * fix tests * lint * adjust params * add comment * Jenkins Builds with Snakemake (#85) * merge main * reduce shared fs usage * add debug flag * lint * actually do debug * increase latency wait * fewer shared fs * run snakemake from results directory * fix tests * try to add different source cache location * make the variable a string instead * lint * remove source cache from shared fs * try changing cache location * make source cache results dir * rearrange * make source cache in .snakemake * revert some debugging changes * add integration tests to jenkins (#87) * add integration tests to jenkins * adjust test parameters * lint * adjust specification organization * remove string wrap --------- Co-authored-by: Steve Bachmeier <[email protected]> Co-authored-by: Steve Bachmeier <[email protected]>

* Basic Snakemake functionality (#73) * wrap snakefile into linker run * basic implementation * some yaml stuff * remove snakefile stuff * start writing * write the snakefile better this time * newln * add cache and remote * remove drmaa log dir * remove common.smk * add Rule dataclass * remove unused utils * refactor where validations live * remove snakemake utils * cleanup * lint * add validation rules * reverse input and output * lint * fix some tests * add back implementation config * lint * remove temp * add other script commands * remove slurm supprt for now * remove the false script * add diagnostics back * fix test * add comment * fix typing * fix script cmd to property * revert step change * adjust some tests * revert script change again * add dir for input validations * lint * add some type hints * fix metadata errors * fix script command * add new tests * fix errors * fix test again * use pipeline file * container_path -> image_path * add missing type hints * lint * change to image path * rename ValidationRule and add comment * add comment * fix validations test * lint * rename step validation * ensure rules have right number of lines * added docstrings for rules * add todos, clean up * fix broken test * Add Snakemake Containers (#76) * initial solution * small refactors * fix pyspark and R * lint * fix existing texts * add unit tests * simplify implementation metadata * add snakemake dep * reformat dep * check 3.11 and 3.12 * create linker subdir and rename bind_dir * make paths absolute * fix the test too * Snakemake SLURM Functionality (#77) * add resources * work on logging * configure tmp dir * cleanup * fix broken tests * lint * format output logging * adjust some comments * add local / "slurm" arguments test * add local / slurm specific string tests * lint * better check for if on slurm * use steve's check on slurm * ingnore slurm check on GHA * add executor dep * add step name to rules * parameterize tests --------- Co-authored-by: Steve Bachmeier <[email protected]> * Refactor results_dir into Config (#78) * absorb results dir into config * fix linker tmp * fix numerous tests * delete unintended files * change test output dir * refactor to snake_filepath, also overwrite if exists * refactor some config items * lint * mess around with strings and Paths * change some updates to set value * move copy to results * lint * reset the old umask * wrap umask intermediate in try/finally * wrap umask intermediate in try/finally (#80) * Resolve Merge Conflicts, e2e and integration testing (#79) * Feature/sbachmei/mic 4846 local e2e tests (#74) * better check for if on slurm (#75) * move specs * change mem arg * add new tests * remove todo * add debug arg * move spec dir * fix broken tests * add docstrings * lint * add comments to todos * standardize resource names * remove quoting * fix integration test * fix tests * change to cpus per task * make keys agnostic to slurm or implementation resources --------- Co-authored-by: Steve Bachmeier <[email protected]> * Pin Executor Plugin Package (#81) * add pin * lint * add two more steps * add fake containers * Snakemake Spark (#84) * make attribute public * add requires_spark and additional snakefile declarations * add to rule defs * add snakefle path * add spark snakefile * whoops, we were in minutes * make it a module * adjust spark smk * lint * change number of workers depending on spark * use spark resources * add timeouts * add slurm logs and output file flexibility * lint * adjust existing tests * remove unused code * lint * remove more util tests * fix tests * revert metadata * lint * revert metadata again * remove duplicate escapes * change spark resources * adjust configuration and defaults * allow non-int resources * change a word * revert change from str to int * i did a bad job setting things back to the way they were * remove get jobs helper * remove import * adjust spark alloc * write resources to params * fix tests * lint * adjust params * add comment * Jenkins Builds with Snakemake (#85) * merge main * reduce shared fs usage * add debug flag * lint * actually do debug * increase latency wait * fewer shared fs * run snakemake from results directory * fix tests * try to add different source cache location * make the variable a string instead * lint * remove source cache from shared fs * try changing cache location * make source cache results dir * rearrange * make source cache in .snakemake * revert some debugging changes * adjust syntax * add back imp metadata * fix specs * remove duplicate code * fix typo * change checksum * accidentally commented out slurm e2e --------- Co-authored-by: Steve Bachmeier <[email protected]> Co-authored-by: Steve Bachmeier <[email protected]>

patricktnast added 12 commits March 14, 2024 12:15

add resources

f9c3b1c

work on logging

a1bb558

Merge remote-tracking branch 'origin/epic/snakemake-integration' into…

3e1e7ca

… feature/pnast/MIC-4905-cluster

configure tmp dir

636795b

cleanup

a887b4b

fix broken tests

69f1a9d

lint

7012106

format output logging

8bf5787

adjust some comments

cb719a0

add local / "slurm" arguments test

6e430e9

add local / slurm specific string tests

abdc59c

lint

98b2801

patricktnast changed the title ~~Feature/pnast/mic 4905 cluster~~ Snakemake SLURM Functionality Mar 19, 2024

stevebachmeier and others added 3 commits March 19, 2024 09:58

better check for if on slurm

4484216

use steve's check on slurm

0965558

ingnore slurm check on GHA

c5c955a

patricktnast marked this pull request as ready for review March 19, 2024 17:07

patricktnast requested review from aflaxman, albrja, hussain-jafari, rmudambi, stevebachmeier and zmbc as code owners March 19, 2024 17:07

zmbc approved these changes Mar 19, 2024

View reviewed changes

patricktnast added 2 commits March 20, 2024 11:29

add executor dep

79eb332

add step name to rules

d944213

stevebachmeier approved these changes Mar 20, 2024

View reviewed changes

stevebachmeier self-requested a review March 20, 2024 19:39

stevebachmeier previously requested changes Mar 20, 2024

View reviewed changes

parameterize tests

ce581ba

patricktnast requested a review from stevebachmeier March 20, 2024 19:58

patricktnast merged commit 701421b into epic/snakemake-integration Mar 20, 2024
4 checks passed

patricktnast deleted the feature/pnast/MIC-4905-cluster branch March 20, 2024 21:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Snakemake SLURM Functionality #77

Snakemake SLURM Functionality #77

patricktnast commented Mar 18, 2024 •

edited

Loading

patricktnast commented Mar 19, 2024

patricktnast commented Mar 19, 2024

zmbc left a comment

zmbc commented Mar 19, 2024 •

edited

Loading

zmbc commented Mar 19, 2024

patricktnast commented Mar 20, 2024

stevebachmeier Mar 20, 2024

patricktnast Mar 20, 2024

stevebachmeier Mar 20, 2024

patricktnast Mar 20, 2024

stevebachmeier Mar 20, 2024

patricktnast Mar 20, 2024

stevebachmeier Mar 20, 2024

stevebachmeier Mar 20, 2024

patricktnast Mar 20, 2024

stevebachmeier Mar 20, 2024

stevebachmeier commented Mar 20, 2024

stevebachmeier left a comment

patricktnast commented Mar 20, 2024

		assert snake_str_lines[i].strip() == expected_line.strip()


		def test_build_snakefile_slurm(default_config_params, mocker, test_dir):

		@@ -26,6 +27,15 @@
		IN_GITHUB_ACTIONS = os.getenv("GITHUB_ACTIONS") == "true"


		def test_is_on_slurm():

Snakemake SLURM Functionality #77

Snakemake SLURM Functionality #77

Conversation

patricktnast commented Mar 18, 2024 • edited Loading