Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Snakemake SLURM Functionality #77

Merged

Conversation

patricktnast
Copy link
Contributor

@patricktnast patricktnast commented Mar 18, 2024

Snakemake SLURM Functionality

Description

Changes and notes

Added slurm back to possible execution methods
adjusted snakemake logging a bit

Verification and Testing

added unit tests; ran with pandas, pyspark, r implementations

@patricktnast patricktnast changed the title Feature/pnast/mic 4905 cluster Snakemake SLURM Functionality Mar 19, 2024
@patricktnast
Copy link
Contributor Author

output:

2024-03-19 09:42:43.574 | 0:00:01.063229 | run:90 - Results directory: /mnt/share/homes/pnast/scratch/linker/2024_03_19_09_42_43
2024-03-19 09:42:43.783 | 0:00:01.271658 | _get_required_attribute:164 - Assigning default value for container_engine: 'undefined'
2024-03-19 09:42:43.783 | 0:00:01.272345 | _get_spark_requests:197 - Assigning default values for spark: '{'workers': {'num_workers': 2, 'cpus_per_node': 1, 'mem_per_node': 1, 'time_limit': 1}, 'keep_alive': False}'
2024-03-19 09:42:43.804 | 0:00:01.293292 | run:98 - Results directory: /mnt/share/homes/pnast/scratch/linker/2024_03_19_09_42_43
2024-03-19 09:42:43.879 | 0:00:01.368531 | main:49 - Running Snakemake

[Tue Mar 19 09:42:45 2024]
Job 3: Validating step_1_python_pyspark_distributed input
Reason: Missing output files: /mnt/share/homes/pnast/scratch/linker/2024_03_19_09_42_43/input_validations/step_1_python_pyspark_distributed_validator


[Tue Mar 19 09:42:45 2024]
Job 2: Running Implementation step_1_python_pyspark_distributed
Reason: Missing output files: /mnt/share/homes/pnast/scratch/linker/2024_03_19_09_42_43/intermediate/1_step_1/result.parquet; Input files updated by another job: /mnt/share/homes/pnast/scratch/linker/2024_03_19_09_42_43/input_validations/step_1_python_pyspark_distributed_validator


[Tue Mar 19 09:44:15 2024]
Job 4: Validating step_2_r input
Reason: Missing output files: /mnt/share/homes/pnast/scratch/linker/2024_03_19_09_42_43/input_validations/step_2_r_validator; Input files updated by another job: /mnt/share/homes/pnast/scratch/linker/2024_03_19_09_42_43/intermediate/1_step_1/result.parquet


[Tue Mar 19 09:44:15 2024]
Job 1: Running Implementation step_2_r
Reason: Missing output files: /mnt/share/homes/pnast/scratch/linker/2024_03_19_09_42_43/result.parquet; Input files updated by another job: /mnt/share/homes/pnast/scratch/linker/2024_03_19_09_42_43/intermediate/1_step_1/result.parquet, /mnt/share/homes/pnast/scratch/linker/2024_03_19_09_42_43/input_validations/step_2_r_validator


[Tue Mar 19 09:44:45 2024]
Job 5: Validating results input
Reason: Missing output files: /mnt/share/homes/pnast/scratch/linker/2024_03_19_09_42_43/input_validations/final_validator; Input files updated by another job: /mnt/share/homes/pnast/scratch/linker/2024_03_19_09_42_43/result.parquet


[Tue Mar 19 09:44:45 2024]
Job 0: Grabbing final output
Reason: Input files updated by another job: /mnt/share/homes/pnast/scratch/linker/2024_03_19_09_42_43/input_validations/final_validator, /mnt/share/homes/pnast/scratch/linker/2024_03_19_09_42_43/result.parquet```

@patricktnast
Copy link
Contributor Author

results dir tree:

├── diagnostics
│   ├── 1_step_1
│   │   ├── diagnostics.yaml
│   │   ├── step_1_python_pyspark_distributed-output.log
│   │   └── step_1_python_pyspark_distributed-slurm-11068690.log
│   └── 2_step_2
│       ├── diagnostics.yaml
│       ├── step_2_r-output.log
│       └── step_2_r-slurm-11068739.log
├── environment.yaml
├── input_data.yaml
├── input_validations
│   ├── final_validator
│   ├── step_1_python_pyspark_distributed_validator
│   └── step_2_r_validator
├── intermediate
│   └── 1_step_1
│       └── result.parquet
├── pipeline.yaml
├── result.parquet
└── Snakefile

@patricktnast patricktnast marked this pull request as ready for review March 19, 2024 17:07
Copy link
Collaborator

@zmbc zmbc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great to me! Want to confirm whether the DRMAA dependency is gone on this branch, I'd like to test on HYAK.

src/linker/rule.py Outdated Show resolved Hide resolved
src/linker/rule.py Outdated Show resolved Hide resolved
src/linker/rule.py Show resolved Hide resolved
src/linker/runner.py Show resolved Hide resolved
src/linker/runner.py Outdated Show resolved Hide resolved
src/linker/utilities/paths.py Outdated Show resolved Hide resolved
@zmbc
Copy link
Collaborator

zmbc commented Mar 19, 2024

Just tested this on HYAK -- it doesn't have SLURM_ROOT 😞 When we change it to just check for an "sbatch" command or similar, then it should at least get past that roadblock.

@zmbc
Copy link
Collaborator

zmbc commented Mar 19, 2024

Manually commenting that bit out, I see linker: error: argument --executor/-e: invalid choice: 'slurm' (choose from 'local', 'dryrun', 'touch'). I had to manually run pip install snakemake-executor-plugin-slurm to get past this; it should probably be included in the linker dependencies.

@patricktnast
Copy link
Contributor Author

Manually commenting that bit out, I see linker: error: argument --executor/-e: invalid choice: 'slurm' (choose from 'local', 'dryrun', 'touch'). I had to manually run pip install snakemake-executor-plugin-slurm to get past this; it should probably be included in the linker dependencies.

good catch, fixed

rule:
name: "{self.name}"
name: "{self.implementation_name}"
message: "Running {self.step_name} implementation: {self.implementation_name}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

message is snakemake's way of logging to stdout?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah it adds to the snakemake logging--example in the stdout posted above

@@ -121,17 +121,24 @@ def write_implementation_rules(
validation_file = str(
results_dir / "input_validations" / implementation.validation_filename
)
resources = (
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we may want to specify this as "slurm_resources" (as opposed to spark resources or in the future who knows what else)

Unless your thought is the logic will go here eventually regardless of resource type?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think the more general "resources" make sense at least on the rule side, given that we may support non-slurm execution resources, and this would be the place that those would go (though not necessarily for something like spark,)
https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#snakefiles-standard-resources

@@ -38,6 +39,9 @@ def main(
## See above
"--envvars",
"foo",
## Suppress some of the snakemake output
"--quiet",
"progress",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this saying to suppross some output (--quiet) but to show progress...bars?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"progress" is the argument to "--quiet", it's suppressing some logging about the progress of execution

@@ -11,6 +11,11 @@
from linker.configuration import Config


def is_on_slurm() -> bool:
"""Returns True if the current environment is a SLURM cluster."""
return "SLURM_ROOT" in os.environ
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, right, in my unmerged branch for Jenkins a better check is return shutil.which("sbatch") is not None

assert snake_str_lines[i].strip() == expected_line.strip()


def test_build_snakefile_slurm(default_config_params, mocker, test_dir):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't look super close but these two tests seem to be lots of duplicate code - can you parameterize?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -26,6 +27,15 @@
IN_GITHUB_ACTIONS = os.getenv("GITHUB_ACTIONS") == "true"


def test_is_on_slurm():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see, this is just merge conflict stuff - it's fine as is and I'll get the better is_on_slurm in my pr

@stevebachmeier stevebachmeier self-requested a review March 20, 2024 19:39
@stevebachmeier
Copy link
Collaborator

No e2e tests? (I know they don't work via github actions yet but they should be able to work locally)

Copy link
Collaborator

@stevebachmeier stevebachmeier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Mostly nits and questions, though I think we need the (locally supported only) e2e tests

@patricktnast
Copy link
Contributor Author

No e2e tests? (I know they don't work via github actions yet but they should be able to work locally)

I was going to do that as a part of https://jira.ihme.washington.edu/secure/RapidBoard.jspa?rapidView=367&projectKey=MIC&view=detail&selectedIssue=MIC-4938#
because I anticipated that it would conflict with the testing regime you created and we should resolve those conflicts first.

@patricktnast patricktnast merged commit 701421b into epic/snakemake-integration Mar 20, 2024
4 checks passed
@patricktnast patricktnast deleted the feature/pnast/MIC-4905-cluster branch March 20, 2024 21:13
patricktnast added a commit that referenced this pull request Apr 24, 2024
* Basic Snakemake functionality (#73)

* wrap snakefile into linker run

* basic implementation

* some yaml stuff

* remove snakefile stuff

* start writing

* write the snakefile better this time

* newln

* add cache and remote

* remove drmaa log dir

* remove common.smk

* add Rule dataclass

* remove unused utils

* refactor where validations live

* remove snakemake utils

* cleanup

* lint

* add validation rules

* reverse input and output

* lint

* fix some tests

* add back implementation config

* lint

* remove temp

* add other script commands

* remove slurm supprt for now

* remove the false script

* add diagnostics back

* fix test

* add comment

* fix typing

* fix script cmd to property

* revert step change

* adjust some tests

* revert script change again

* add dir for input validations

* lint

* add some type hints

* fix metadata errors

* fix script command

* add new tests

* fix errors

* fix test again

* use pipeline file

* container_path -> image_path

* add missing type hints

* lint

* change to image path

* rename ValidationRule and add comment

* add comment

* fix validations test

* lint

* rename step validation

* ensure rules have right number of lines

* added docstrings for rules

* add todos, clean up

* fix broken test

* Add Snakemake Containers (#76)

* initial solution

* small refactors

* fix pyspark and R

* lint

* fix existing texts

* add unit tests

* simplify implementation metadata

* add snakemake dep

* reformat dep

* check 3.11 and 3.12

* create linker subdir and rename bind_dir

* make paths absolute

* fix the test too

* Snakemake SLURM Functionality (#77)

* add resources

* work on logging

* configure tmp dir

* cleanup

* fix broken tests

* lint

* format output logging

* adjust some comments

* add local / "slurm" arguments test

* add local / slurm specific string tests

* lint

* better check for if on slurm

* use steve's check on slurm

* ingnore slurm check on GHA

* add executor dep

* add step name to rules

* parameterize tests

---------

Co-authored-by: Steve Bachmeier <[email protected]>

* Refactor results_dir into Config (#78)

* absorb results dir into config

* fix linker tmp

* fix numerous tests

* delete unintended files

* change test output dir

* refactor to snake_filepath, also overwrite if exists

* refactor some config items

* lint

* mess around with strings and Paths

* change some updates to set value

* move copy to results

* lint

* reset the old umask

* wrap umask intermediate in try/finally

* wrap umask intermediate in try/finally (#80)

* Resolve Merge Conflicts, e2e and integration testing (#79)

* Feature/sbachmei/mic 4846 local e2e tests (#74)

* better check for if on slurm (#75)

* move specs

* change mem arg

* add new tests

* remove todo

* add debug arg

* move spec dir

* fix broken tests

* add docstrings

* lint

* add comments to todos

* standardize resource names

* remove quoting

* fix integration test

* fix tests

* change to cpus per task

* make keys agnostic to slurm or implementation resources

---------

Co-authored-by: Steve Bachmeier <[email protected]>

* Pin Executor Plugin Package (#81)

* add pin

* lint

* Snakemake Spark (#84)

* make attribute public

* add requires_spark and additional snakefile declarations

* add to rule defs

* add snakefle path

* add spark snakefile

* whoops, we were in minutes

* make it a module

* adjust spark smk

* lint

* change number of workers depending on spark

* use spark resources

* add timeouts

* add slurm logs and output file flexibility

* lint

* adjust existing tests

* remove unused code

* lint

* remove more util tests

* fix tests

* revert metadata

* lint

* revert metadata again

* remove duplicate escapes

* change spark resources

* adjust configuration and defaults

* allow non-int resources

* change a word

* revert change from str to int

* i did a bad job setting things back to the way they were

* remove get jobs helper

* remove import

* adjust spark alloc

* write resources to params

* fix tests

* lint

* adjust params

* add comment

* Jenkins Builds with Snakemake (#85)

* merge main

* reduce shared fs usage

* add debug flag

* lint

* actually do debug

* increase latency wait

* fewer shared fs

* run snakemake from results directory

* fix tests

* try to add different source cache location

* make the variable a string instead

* lint

* remove source cache from shared fs

* try changing cache location

* make source cache results dir

* rearrange

* make source cache in .snakemake

* revert some debugging changes

* add integration tests to jenkins (#87)

* add integration tests to jenkins

* adjust test parameters

* lint

* adjust specification organization

* remove string wrap

---------

Co-authored-by: Steve Bachmeier <[email protected]>
Co-authored-by: Steve Bachmeier <[email protected]>
patricktnast added a commit that referenced this pull request Apr 24, 2024
* Basic Snakemake functionality (#73)

* wrap snakefile into linker run

* basic implementation

* some yaml stuff

* remove snakefile stuff

* start writing

* write the snakefile better this time

* newln

* add cache and remote

* remove drmaa log dir

* remove common.smk

* add Rule dataclass

* remove unused utils

* refactor where validations live

* remove snakemake utils

* cleanup

* lint

* add validation rules

* reverse input and output

* lint

* fix some tests

* add back implementation config

* lint

* remove temp

* add other script commands

* remove slurm supprt for now

* remove the false script

* add diagnostics back

* fix test

* add comment

* fix typing

* fix script cmd to property

* revert step change

* adjust some tests

* revert script change again

* add dir for input validations

* lint

* add some type hints

* fix metadata errors

* fix script command

* add new tests

* fix errors

* fix test again

* use pipeline file

* container_path -> image_path

* add missing type hints

* lint

* change to image path

* rename ValidationRule and add comment

* add comment

* fix validations test

* lint

* rename step validation

* ensure rules have right number of lines

* added docstrings for rules

* add todos, clean up

* fix broken test

* Add Snakemake Containers (#76)

* initial solution

* small refactors

* fix pyspark and R

* lint

* fix existing texts

* add unit tests

* simplify implementation metadata

* add snakemake dep

* reformat dep

* check 3.11 and 3.12

* create linker subdir and rename bind_dir

* make paths absolute

* fix the test too

* Snakemake SLURM Functionality (#77)

* add resources

* work on logging

* configure tmp dir

* cleanup

* fix broken tests

* lint

* format output logging

* adjust some comments

* add local / "slurm" arguments test

* add local / slurm specific string tests

* lint

* better check for if on slurm

* use steve's check on slurm

* ingnore slurm check on GHA

* add executor dep

* add step name to rules

* parameterize tests

---------

Co-authored-by: Steve Bachmeier <[email protected]>

* Refactor results_dir into Config (#78)

* absorb results dir into config

* fix linker tmp

* fix numerous tests

* delete unintended files

* change test output dir

* refactor to snake_filepath, also overwrite if exists

* refactor some config items

* lint

* mess around with strings and Paths

* change some updates to set value

* move copy to results

* lint

* reset the old umask

* wrap umask intermediate in try/finally

* wrap umask intermediate in try/finally (#80)

* Resolve Merge Conflicts, e2e and integration testing (#79)

* Feature/sbachmei/mic 4846 local e2e tests (#74)

* better check for if on slurm (#75)

* move specs

* change mem arg

* add new tests

* remove todo

* add debug arg

* move spec dir

* fix broken tests

* add docstrings

* lint

* add comments to todos

* standardize resource names

* remove quoting

* fix integration test

* fix tests

* change to cpus per task

* make keys agnostic to slurm or implementation resources

---------

Co-authored-by: Steve Bachmeier <[email protected]>

* Pin Executor Plugin Package (#81)

* add pin

* lint

* add two more steps

* add fake containers

* Snakemake Spark (#84)

* make attribute public

* add requires_spark and additional snakefile declarations

* add to rule defs

* add snakefle path

* add spark snakefile

* whoops, we were in minutes

* make it a module

* adjust spark smk

* lint

* change number of workers depending on spark

* use spark resources

* add timeouts

* add slurm logs and output file flexibility

* lint

* adjust existing tests

* remove unused code

* lint

* remove more util tests

* fix tests

* revert metadata

* lint

* revert metadata again

* remove duplicate escapes

* change spark resources

* adjust configuration and defaults

* allow non-int resources

* change a word

* revert change from str to int

* i did a bad job setting things back to the way they were

* remove get jobs helper

* remove import

* adjust spark alloc

* write resources to params

* fix tests

* lint

* adjust params

* add comment

* Jenkins Builds with Snakemake (#85)

* merge main

* reduce shared fs usage

* add debug flag

* lint

* actually do debug

* increase latency wait

* fewer shared fs

* run snakemake from results directory

* fix tests

* try to add different source cache location

* make the variable a string instead

* lint

* remove source cache from shared fs

* try changing cache location

* make source cache results dir

* rearrange

* make source cache in .snakemake

* revert some debugging changes

* adjust syntax

* add back imp metadata

* fix specs

* remove duplicate code

* fix typo

* change checksum

* accidentally commented out slurm e2e

---------

Co-authored-by: Steve Bachmeier <[email protected]>
Co-authored-by: Steve Bachmeier <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants