Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#DRAFT/WIP 20 allow custom submissions to slurm #21

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

rahmans1
Copy link
Contributor

Briefly, what does this PR introduce?

Adapt for custom submissions to narval

What kind of change does this PR introduce?

Please check if this PR fulfills the following:

  • Tests for the changes have been added
  • Documentation has been added / updated
  • Changes have been communicated to collaborators

Does this PR introduce breaking changes? What changes might users need to make to their code?

Does this PR change default behavior?

@rahmans1 rahmans1 linked an issue Nov 18, 2023 that may be closed by this pull request
@rahmans1
Copy link
Contributor Author

Running into this error on slurm due to https://github.com/eic/simulation_campaign_single/blob/main/scripts/run.sh#L87

-rw-r-----. 1 rahmans nogroup    353 Nov 14 01:46 README.md
drwxr-s---. 3 rahmans nogroup  25600 Nov 18 02:42 scripts
drwxr-s---. 2 rahmans nogroup  25600 Nov 18 02:41 templates
-rwxr-x---. 1 rahmans nogroup    112 Nov 14 01:46 test.sh
-rw-r-----. 1 rahmans nogroup    184 Nov 14 01:46 test.submit
SLURM_TMPDIR=/localscratch/rahmans.23022814.0
SLURM_JOB_ID=23022814
SLURM_ARRAY_JOB_ID=23022772
SLURM_ARRAY_TASK_ID=1
_CONDOR_SCRATCH_DIR=
OSG_WN_TMP=
TMPDIR=/localscratch/rahmans.23022814.0
/opt/campaigns/single/scripts/run.sh: Error on line 87: mkdir -p ${TMPDIR}
/home/rahmans/projects/rrg-wdconinc/rahmans/hpc_production_workflow/job_submission_condor/scripts/run.sh: Error on line 44: ${CAMPAIGNS:-/opt/campaigns/${TYPE}}/scripts/run.sh ${INPUT_FILE} ${EVENTS_PER_TASK} ${TASK}
"LOG/SLURM/slurm_23022772/slurm_23022772_1.out" 256L, 15741C                              256,1         Bot

@wdconinc
Copy link
Contributor

Is this merging https://github.com/eic/job_submission_slurm/ into this?

@wdconinc
Copy link
Contributor

It'd be great if that means we can give up on having two repos.





condor_submit -verbose -file ${SUBMIT_FILE}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not needed anymore when SYSTEM==slurm?

@@ -65,9 +89,44 @@ sed "
s|%INPUT_FILES%|${INPUT_FILES}|g;
s|%REQUIREMENTS%|${REQUIREMENTS}|g;
s|%CSV_FILE%|${CSV_FILE}|g;
s|%ACCOUNT%|${ACCOUNT:-rrg-wdconinc}|g;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefer to crash this if not specified, rather than have everyone submit under my allocation ;-)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Example: ${ACCOUNT:?Define ACCOUNT with the slurm account}, ref: https://www.gnu.org/software/bash/manual/bash.html#Shell-Parameter-Expansion

@rahmans1
Copy link
Contributor Author

It'd be great if that means we can give up on having two repos.

Yup. That's the idea.

@rahmans1
Copy link
Contributor Author

It'd be great if that means we can give up on having two repos.

Yup. That's the idea.

Also give a familiar interface to submitters as the campaign coordinators. So, users can for example can run custom workflows like this on narval or ifarm without having write privileges to xrootd.

CAMPAIGN_OUTPUT=$SHARED/ePIC/ePIC-Campaign-Organizer/Campaign_Output SYSTEM=slurm EBEAM=5 PBEAM=41 DETECTOR_VERSION=main DETECTOR_CONFIG=epic_craterlake JUG_XL_TAG=nightly CSV_FILE=gamma_1GeV_part2.csv ./scripts/submit_csv.sh narval_csv single SINGLE/etaScan/gamma.csv 1

@@ -28,12 +28,32 @@ shift
TARGET=${1:-2}
shift

# environment variable to indicate whether the job is running on condor or slurm
SYSTEM=${SYSTEM:-condor}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SYSTEM may be a bit generic. How about SCHEDULER or something like that?


# create command line
EXECUTABLE="./scripts/run.sh"
ARGUMENTS="${TYPE} EVGEN/\$(file).\$(ext) \$(nevents) \$(ichunk)"
EXECUTABLE="$PWD/scripts/run.sh"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably should use something that determines the directory of the current script, e.g. SCRIPTDIR=$(dirname $0) or so. Then this script can be run from outside the directory, and the files it generates can be better organized.

Comment on lines +44 to +46
if [ ${SYSTEM} = "condor" ]; then
ARGUMENTS="${TYPE} EVGEN/\$(file).\$(ext) \$(nevents) \$(ichunk)"
elif [ ${SYSTEM} = "slurm" ]; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if elif else fi can be turned into case statement

elif [ ${SYSTEM} = "slurm" ]; then
ARGUMENTS="${TYPE}"
# FIXME: This is not ideal. It prevents from submitting multiple jobs with different JUG_XL_TAG simultaneously.
cd scripts
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above for SCRIPTDIR which would be useful here.

Also pushd-popd for directory stacks then.

done
echo "Submitting ${NJOBS} to a ${SYSTEM} system"

if [ ${SYSTEM} = "condor" ]; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

case.

@rahmans1
Copy link
Contributor Author

Running into this error on slurm due to https://github.com/eic/simulation_campaign_single/blob/main/scripts/run.sh#L87

-rw-r-----. 1 rahmans nogroup    353 Nov 14 01:46 README.md
drwxr-s---. 3 rahmans nogroup  25600 Nov 18 02:42 scripts
drwxr-s---. 2 rahmans nogroup  25600 Nov 18 02:41 templates
-rwxr-x---. 1 rahmans nogroup    112 Nov 14 01:46 test.sh
-rw-r-----. 1 rahmans nogroup    184 Nov 14 01:46 test.submit
SLURM_TMPDIR=/localscratch/rahmans.23022814.0
SLURM_JOB_ID=23022814
SLURM_ARRAY_JOB_ID=23022772
SLURM_ARRAY_TASK_ID=1
_CONDOR_SCRATCH_DIR=
OSG_WN_TMP=
TMPDIR=/localscratch/rahmans.23022814.0
/opt/campaigns/single/scripts/run.sh: Error on line 87: mkdir -p ${TMPDIR}
/home/rahmans/projects/rrg-wdconinc/rahmans/hpc_production_workflow/job_submission_condor/scripts/run.sh: Error on line 44: ${CAMPAIGNS:-/opt/campaigns/${TYPE}}/scripts/run.sh ${INPUT_FILE} ${EVENTS_PER_TASK} ${TASK}
"LOG/SLURM/slurm_23022772/slurm_23022772_1.out" 256L, 15741C                              256,1         Bot

@wdconinc Here is a minimal script that tries to just run the same command on the remote node and surprisingly it doesn't face permissions issues when writing that file. Granted i downloaded eic-shell prior to submitting the job. Really confusing why the main job submission script is failing.

#!/bin/bash
#SBATCH --account=def-wdconinc
#SBATCH --array=1-10
#SBATCH --ntasks=1                   # number of MPI processes
#SBATCH --mem-per-cpu=3G             # memory; default unit is megabytes
#SBATCH --time=0-02:00         # time (DD-HH:MM)
#SBATCH --output=slurm_%A_%a.out # standard output log
#SBATCH --error=slurm_%A_%a.err  # standard error log

echo "SLURM_TMPDIR=${SLURM_TMPDIR:-}"
echo "SLURM_JOB_ID=${SLURM_JOB_ID:-}"
echo "SLURM_ARRAY_JOB_ID=${SLURM_ARRAY_JOB_ID:-}"
echo "SLURM_ARRAY_TASK_ID=${SLURM_ARRAY_TASK_ID:-}"
echo "_CONDOR_SCRATCH_DIR=${_CONDOR_SCRATCH_DIR:-}"
echo "OSG_WN_TMP=${OSG_WN_TMP:-}"
if [ -n "${SLURM_TMPDIR:-}" ] ; then
  TMPDIR=${SLURM_TMPDIR}
elif [ -n "${_CONDOR_SCRATCH_DIR:-}" ] ; then
  TMPDIR=${_CONDOR_SCRATCH_DIR}
else
  if [ -d "/scratch/slurm/${SLURM_JOB_ID:-}" ] ; then
    TMPDIR="/scratch/slurm/${SLURM_JOB_ID:-}"
  else
    TMPDIR=${TMPDIR:-/tmp}/${$}
  fi
fi

cat << EOF | $PWD/eic-shell

echo "TMPDIR=${TMPDIR}"
mkdir -p ${TMPDIR}
echo "JOB COMPLETED"

EOF

Here is the slurm log file and the job finishes without issues

SLURM_TMPDIR=/localscratch/rahmans.28422879.0
SLURM_JOB_ID=28422879
SLURM_ARRAY_JOB_ID=28422870
SLURM_ARRAY_TASK_ID=9
_CONDOR_SCRATCH_DIR=
OSG_WN_TMP=
TMPDIR=/localscratch/rahmans.28422879.0
JOB COMPLETED

@rahmans1
Copy link
Contributor Author

EBEAM=5 PBEAM=41 DETECTOR_VERSION=24.04.0 DETECTOR_CONFIG=epic_craterlake CSV_FILE=gamma_1GeV_small.csv ACCOUNT=def-wdconinc SYSTEM=slurm JUG_XL_TAG=24.04.0-stable ./scripts/submit_csv.sh narval_csv hepmc3 gamma_1GeV_small.csv 2

where gamma_1GeV_small.csv contains

SINGLE/gamma/1GeV/etaScan/gamma_1GeV_eta0.0,steer,4115,0000
SINGLE/gamma/1GeV/etaScan/gamma_1GeV_eta0.0,steer,4115,0001

And that leads to this log


[rahmans@narval1 job_submission_condor]$ cat LOG/SLURM/slurm_28422945/slurm_28422945_1.out
SINGLE/gamma/1GeV/etaScan/gamma_1GeV_eta0.0
steer
4115
0000
Mon Apr 29 07:54:29 PM EDT 2024
environment-2024-04-29T19:18-04:00.sh
environment-2024-04-29T19:20-04:00.sh
environment-2024-04-29T19:27-04:00.sh
environment-2024-04-29T19:54-04:00.sh
environment-2024-04-29T19:18-04:00.sh:export S3_ACCESS_KEY=
environment-2024-04-29T19:18-04:00.sh:export S3RW_ACCESS_KEY=
environment-2024-04-29T19:18-04:00.sh:export DETECTOR_VERSION=24.04.0
environment-2024-04-29T19:18-04:00.sh:export DETECTOR_CONFIG=epic_craterlake
environment-2024-04-29T19:18-04:00.sh:export EBEAM=5
environment-2024-04-29T19:18-04:00.sh:export PBEAM=41
environment-2024-04-29T19:20-04:00.sh:export S3_ACCESS_KEY=
environment-2024-04-29T19:20-04:00.sh:export S3RW_ACCESS_KEY=
environment-2024-04-29T19:20-04:00.sh:export DETECTOR_VERSION=24.04.0
environment-2024-04-29T19:20-04:00.sh:export DETECTOR_CONFIG=epic_craterlake
environment-2024-04-29T19:20-04:00.sh:export EBEAM=5
environment-2024-04-29T19:20-04:00.sh:export PBEAM=41
environment-2024-04-29T19:27-04:00.sh:export S3_ACCESS_KEY=
environment-2024-04-29T19:27-04:00.sh:export S3RW_ACCESS_KEY=
environment-2024-04-29T19:27-04:00.sh:export DETECTOR_VERSION=24.04.0
environment-2024-04-29T19:27-04:00.sh:export DETECTOR_CONFIG=epic_craterlake
environment-2024-04-29T19:27-04:00.sh:export EBEAM=5
environment-2024-04-29T19:27-04:00.sh:export PBEAM=41
environment-2024-04-29T19:54-04:00.sh:export S3_ACCESS_KEY=
environment-2024-04-29T19:54-04:00.sh:export S3RW_ACCESS_KEY=
environment-2024-04-29T19:54-04:00.sh:export DETECTOR_VERSION=24.04.0
environment-2024-04-29T19:54-04:00.sh:export DETECTOR_CONFIG=epic_craterlake
environment-2024-04-29T19:54-04:00.sh:export EBEAM=5
environment-2024-04-29T19:54-04:00.sh:export PBEAM=41
date sys: Mon Apr 29 07:54:29 PM EDT 2024
date web: /opt/campaigns/hepmc3/scripts/run.sh: Error on line 25: date -d "$(curl --insecure --head --silent --max-redirs 0 google.com 2>&1 | grep Date: | cut -d' ' -f2-7)"
hostname: nc11103.narval.calcul.quebec
uname:    Linux nc11103.narval.calcul.quebec 4.18.0-513.24.1.el8_9.x86_64 #1 SMP Thu Apr 4 18:13:02 UTC 2024 x86_64 GNU/Linux
whoami:   rahmans
pwd:      /lustre06/project/6061913/rahmans/job_submission_condor
site:
resource:
http_proxy:
Filesystem                                   Size  Used Avail Use% Mounted on
fuse-overlayfs                                64M   12K   64M   1% /
devtmpfs                                     126G     0  126G   0% /dev
10.82.48.1@o2ib1:10.82.48.2@o2ib1:/lustre05   48G  5.9G   42G  13% /home/rahmans
pool/localscratch                            805G   43M  805G   1% /var/tmp
10.82.48.7@o2ib1:10.82.48.8@o2ib1:/lustre07  5.7P  2.5P  3.2P  45% /scratch
10.82.48.3@o2ib1:10.82.48.4@o2ib1:/lustre06   32P   15P   18P  46% /lustre06
total 573
drwxr-s---.  6 rahmans nogroup 25600 Apr 29 19:54 .
drwx--S---. 14 rahmans nogroup 33792 Apr 29 19:52 ..
-rw-r-----.  1 rahmans nogroup   194 Apr 29 19:18 environment-2024-04-29T19:18-04:00.sh
-rw-r-----.  1 rahmans nogroup   194 Apr 29 19:20 environment-2024-04-29T19:20-04:00.sh
-rw-r-----.  1 rahmans nogroup   194 Apr 29 19:27 environment-2024-04-29T19:27-04:00.sh
-rw-r-----.  1 rahmans nogroup   194 Apr 29 19:54 environment-2024-04-29T19:54-04:00.sh
-rw-r-----.  1 rahmans nogroup 60264 Apr 29 19:09 gamma_1GeV_part1.csv
-rw-r-----.  1 rahmans nogroup  1053 Apr 29 19:18 gamma_1GeV_part1.submit
-rw-r-----.  1 rahmans nogroup   120 Apr 29 19:20 gamma_1GeV_small.csv
-rw-r-----.  1 rahmans nogroup  1050 Apr 29 19:54 gamma_1GeV_small.submit
drwxr-s---.  7 rahmans nogroup 25600 Apr 29 19:05 .git
-rw-r-----.  1 rahmans nogroup   231 Apr 29 19:05 .gitignore
drwxr-s---.  4 rahmans nogroup 25600 Apr 29 19:20 LOG
-rw-r-----.  1 rahmans nogroup   353 Apr 29 19:05 README.md
drwxr-s---.  3 rahmans nogroup 25600 Apr 29 19:54 scripts
drwxr-s---.  2 rahmans nogroup 25600 Apr 29 19:10 templates
-rwxr-x---.  1 rahmans nogroup   112 Apr 29 19:05 test.sh
-rw-r-----.  1 rahmans nogroup   184 Apr 29 19:05 test.submit
SLURM_TMPDIR=/localscratch/rahmans.28422946.0
SLURM_JOB_ID=28422946
SLURM_ARRAY_JOB_ID=28422945
SLURM_ARRAY_TASK_ID=1
_CONDOR_SCRATCH_DIR=
OSG_WN_TMP=
TMPDIR=/localscratch/rahmans.28422946.0
/opt/campaigns/hepmc3/scripts/run.sh: Error on line 87: mkdir -p ${TMPDIR}
/home/rahmans/projects/rrg-wdconinc/rahmans/job_submission_condor/scripts/run.sh: Error on line 44: ${CAMPAIGNS:-/opt/campaigns/${TYPE}}/scripts/run.sh ${INPUT_FILE} ${EVENTS_PER_TASK} ${TASK}

@wdconinc
Copy link
Contributor

wdconinc commented May 1, 2024

This issue can be resolved here with

diff --git a/templates/narval_csv.submit.in b/templates/narval_csv.submit.in
index 7352b6a..23a5b49 100644
--- a/templates/narval_csv.submit.in
+++ b/templates/narval_csv.submit.in
@@ -15,6 +15,7 @@ echo $ext
 echo $nevents
 echo $ichunk
 
+export SINGULARITY_OPTIONS="--bind $SLURM_TMPDIR"
 cat << EOF | $(dirname %EXECUTABLE%)/eic-shell
 
 %EXECUTABLE% %ARGUMENTS% EVGEN/${file}.${ext} ${nevents} ${ichunk}

but that doesn't actually get us any further on the osg condor side.

@wdconinc
Copy link
Contributor

wdconinc commented May 1, 2024

(with that diff it just fails a few lines later, but ok)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Adapt the workflow for custom submissions to HPC/slurm systems
2 participants