Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slurm doesn't exit on completion #33

Closed
Ulthran opened this issue Feb 6, 2024 · 13 comments
Closed

Slurm doesn't exit on completion #33

Ulthran opened this issue Feb 6, 2024 · 13 comments

Comments

@Ulthran
Copy link
Contributor

Ulthran commented Feb 6, 2024

With past versions of snakemake, there have been issues with slurm not exiting the root job once everything was completed. This was solved by including a --cluster-status script (e.g. https://github.com/Snakemake-Profiles/slurm/blob/master/%7B%7Bcookiecutter.profile_name%7D%7D/slurm-status.py) but that argument is no longer in snakemake's CLI. Am I right in assuming that means the burden of stopping slurm properly has moved over to this module?

This is the snippet of a log from a slurm test I just ran that I had to cancel (along with many others that I let run longer) exhibiting the bad behavior:

.
.
.
Finished job 2.
4 of 5 steps (80%) done
Resources before job selection: {'_cores': 9223372036854775807, '_nodes': 10}
Ready jobs (1)
Select jobs to execute...
Using greedy selector because only single job has to be scheduled.
Selected jobs (1)
Resources after job selection: {'_cores': 9223372036854775806, '_nodes': 9}
Execute 1 jobs...
[Tue Feb  6 17:18:18 2024]
rule test:
    input: /home/runner/work/sunbeam/sunbeam/projects/test/sunbeam_output/qc/00_samples/SHORT_1.fastq.gz, /home/runner/work/sunbeam/sunbeam/projects/test/sunbeam_output/qc/00_samples/SHORT_2.fastq.gz, /home/runner/work/sunbeam/sunbeam/projects/test/sunbeam_output/qc/00_samples/LONG_1.fastq.gz, /home/runner/work/sunbeam/sunbeam/projects/test/sunbeam_output/qc/00_samples/LONG_2.fastq.gz
    jobid: 0
    reason: Rules with a run or shell declaration but no output are always executed.
    resources: mem_mb=1000, mem_mib=954, disk_mb=1000, disk_mib=954, tmpdir=<TBD>, slurm_account=runner, slurm_extra=--job-name=test --output=slurm_test_%j
No wall time information given. This might or might not work on your cluster. If not, specify the resource runtime in your rule or as a reasonable default via --default-resources.
sbatch call: sbatch --job-name 3fae23ac-6a71-41c8-af10-5d691e874141 --output /home/runner/work/sunbeam/sunbeam/.snakemake/slurm_logs/rule_test/%j.log --export=ALL --comment test -A runner -p debug --mem 1000 --cpus-per-task=1 --job-name=test --output=slurm_test_%j -D /home/runner/work/sunbeam/sunbeam --wrap="/usr/share/miniconda/envs/sunbeam/bin/python3.12 -m snakemake --snakefile '/home/runner/work/sunbeam/sunbeam/workflow/Snakefile' --target-jobs 'test:' --allowed-rules 'test' --cores 'all' --attempt 1 --force-use-threads  --resources 'mem_mb=1000' 'mem_mib=954' 'disk_mb=1000' 'disk_mib=954' --wait-for-files '/home/runner/work/sunbeam/sunbeam/.snakemake/tmp.8abv4hta' '/home/runner/work/sunbeam/sunbeam/projects/test/sunbeam_output/qc/00_samples/SHORT_1.fastq.gz' '/home/runner/work/sunbeam/sunbeam/projects/test/sunbeam_output/qc/00_samples/SHORT_2.fastq.gz' '/home/runner/work/sunbeam/sunbeam/projects/test/sunbeam_output/qc/00_samples/LONG_1.fastq.gz' '/home/runner/work/sunbeam/sunbeam/projects/test/sunbeam_ou
Job 0 has been submitted with SLURM jobid 3 (log: /home/runner/work/sunbeam/sunbeam/.snakemake/slurm_logs/rule_test/3.log).
The job status was queried with command: sacct -X --parsable2 --noheader --format=JobIdRaw,State --starttime now-2days --endtime now --name 3fae23ac-6a71-41c8-af10-5d691e874141
It took: 0.10224318504333496 seconds
The output is:
''
status_of_jobs after sacct is: {}
active_jobs_ids_with_current_sacct_status are: set()
active_jobs_seen_by_sacct are: set()
missing_sacct_status are: set()
The job status was queried with command: sacct -X --parsable2 --noheader --format=JobIdRaw,State --starttime now-2days --endtime now --name 3fae23ac-6a71-41c8-af10-5d691e874141
It took: 0.100[164](https://github.com/sunbeam-labs/sunbeam/actions/runs/7803292091/job/21283429453#step:7:165)89028930664 seconds
The output is:
''
status_of_jobs after sacct is: {}
active_jobs_ids_with_current_sacct_status are: set()
active_jobs_seen_by_sacct are: set()
missing_sacct_status are: set()
The job status was queried with command: sacct -X --parsable2 --noheader --format=JobIdRaw,State --starttime now-2days --endtime now --name 3fae23ac-6a71-41c8-af10-5d691e874141
It took: 0.10007834434509277 seconds
The output is:
''
status_of_jobs after sacct is: {}
active_jobs_ids_with_current_sacct_status are: set()
active_jobs_seen_by_sacct are: set()
missing_sacct_status are: set()
The job status was queried with command: sacct -X --parsable2 --noheader --format=JobIdRaw,State --starttime now-2days --endtime now --name 3fae23ac-6a71-41c8-af10-5d691e874141
It took: 0.10289597511291504 seconds
The output is:
''
status_of_jobs after sacct is: {}
active_jobs_ids_with_current_sacct_status are: set()
active_jobs_seen_by_sacct are: set()
missing_sacct_status are: set()
The job status was queried with command: sacct -X --parsable2 --noheader --format=JobIdRaw,State --starttime now-2days --endtime now --name 3fae23ac-6a71-41c8-af10-5d691e874141
It took: 0.100609064102[172](https://github.com/sunbeam-labs/sunbeam/actions/runs/7803292091/job/21283429453#step:7:173)85 seconds
The output is:
''
status_of_jobs after sacct is: {}
active_jobs_ids_with_current_sacct_status are: set()
active_jobs_seen_by_sacct are: set()
missing_sacct_status are: set()
Error: The operation was canceled.
@cmeesters
Copy link
Member

Can you please indicate the snakemake version and the plugin version, as well as the command line you have been using? That would be tremendously helpful. Thank you!

@Ulthran
Copy link
Contributor Author

Ulthran commented Feb 7, 2024

Snakemake version: 8.4.4
Slurm executor version: 0.3.0
Running snakemake with this profile:

# Default options for running sunbeam on slurm
rerun-incomplete: True
rerun-triggers: "mtime"
latency-wait: 90
keep-going: True
notemp: True
printshellcmds: True
nolock: True
verbose: True

# Environment
software-deployment-method: "conda"

# Cluster configuration
executor: "slurm"
jobs: 10
cores: 24

# Default resource configuration
default-resources:
  - slurm_account="hpcusers" # EDIT THIS TO MATCH YOUR CLUSTER'S ACCOUNT NAME
  - slurm_partition="defq" # EDIT THIS TO MATCH YOUR CLUSTER'S PARTITION NAME
  - mem_mb=8000
  - runtime=15
  - disk_mb=1000

@cmeesters
Copy link
Member

Wait a second: Your workflow runs within a github runner?

@Ulthran
Copy link
Contributor Author

Ulthran commented Feb 7, 2024

Yes, using this https://github.com/koesterlab/setup-slurm-action. I've been seeing the same behavior on a real cluster as well though.

@cmeesters
Copy link
Member

Well, the Ansible configuration was the solution after we tinkered with vagrant ...

I am, however, not sure whether the runner configures the slurmdb correctly and can provide feedback. As for your real cluster: Do you observe the same error message?

@Ulthran
Copy link
Contributor Author

Ulthran commented Feb 7, 2024

Yea the real cluster has the same never ending refresh of sacct coming up empty and continuing.

@cmeesters
Copy link
Member

cmeesters commented Feb 7, 2024

What is your slurm version (sinfo --version)? What if you ask sacct --noheader -X -o jobname%40 and then take one of the job names and run sacct -X --parsable2 --noheader --format=JobIdRaw,State --starttime now-2days --endtime now --name <picked name> - what do you see?

PS for me, it's bedtime, soon. ;-)

@Ulthran
Copy link
Contributor Author

Ulthran commented Feb 9, 2024

Sorry for the delay.

sinfo --version: slurm 21.08.8-2

Running sacct on any of the spawned jobs just shows _JOBID_|RUNNING. Running it on the main process shows a list of previous CANCELLED jobs (one's that wouldn't end that I had to kill) and the current RUNNING one.

Just in case it's helpful, here's the bash script I submit using sbatch run_script.sh:

#!/usr/bin/env bash

#SBATCH --mem=8G
#SBATCH -n 1
#SBATCH --export=ALL
#SBATCH --no-requeue
#SBATCH -t 72:00:00

set -x
set -e

snakemake cmd

@Ulthran
Copy link
Contributor Author

Ulthran commented Feb 9, 2024

Oh wait I think I'm seeing the problem now. Snakemake is running sacct with the random hex string jobname that this executor assigns to it but I've been passing the slurm_extra resource to each rule to change the name to something relevant. So it seems like specifying the jobname is what's causing the disconnect.

@cmeesters
Copy link
Member

Ah, yes. The job names are unique strings for a group of jobs (not a snakemake group job). This way, we can limit the number of queries and put less strain on the slurmdbd.

However, we have the rule name as a comment. So, you can query the comment (manually) with sacct to. For snakemake we leave it the way it is. Just by chance, I see your comments now, because I have been teaching the whole day. I realized, that attaching the sample name to the comment (and extending the docs, like I write in every second issue thread) would make a lot of sense. Next week, I might find the time.

@Ulthran
Copy link
Contributor Author

Ulthran commented Feb 9, 2024

Thanks a lot for your help! That would be great, the easier it is to identify each job the better imo. I'm not seeing any comments on the jobs I have running though. I am still setting each rule's slurm_extra resource to specify the output file name, is it possible that's overwriting the comment somehow?

@cmeesters
Copy link
Member

No, it is not possible to overwrite the comment. Yet, now the comment carries the jobname and wildcards. Whilst it is not sensible to accommodate for every possible string combination, which is the one, you are missing?

@Ulthran
Copy link
Contributor Author

Ulthran commented Feb 26, 2024

Sorry I think everything I need is covered by #35, once that's merged

@Ulthran Ulthran closed this as completed Feb 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants