-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Why do some records have error set? #772
Comments
It appears that of all my managers running on different nodes, only one is successfully completing tasks. All the others fail every task. The log gives no indication of why:
Querying the records, the |
For the first question: If the For the second question: Task
The task vs. record ID trips up a lot of people (and rightly so). I will make a PR soon that makes the difference explicit in the manager logs (for example, by printing both the task ID and the record id). |
Does that mean the root problem is that it ran out of memory? Is there anything I can do about that? These are large systems but not huge, 90 atoms. And it's running on a node with 256 GB of memory. I've noticed from the log that it acquires three tasks at a time and processes them (I assume?) in parallel. Is there a way to tell it to attempt fewer tasks at a time? |
It depends on how many jobs you are running in parallel. Could you post the executors from the configuration? The manager will pull down more tasks than it can compute, so that there is a buffer (since there is a delay between tasks finishing and the manager fetching more tasks). Looking at that error a little bit, I'm not sure I can make sense of it (says both cannot allocate memory, and cannot write to disk). Let me ask the psi4 developers. I will take a look at the others in this dataset and see if there's any pattern or if psi4 just has an issue with that one |
executors:
local_executor:
type: local
max_workers: $MAX_WORKERS # max number of workers to spawn
cores_per_worker: 32 # cores per worker
memory_per_worker: 216 # memory per worker, in GiB
scratch_directory: "$L_SCRATCH/$SLURM_JOBID"
queue_tags:
- spice-psi4-181
environments:
use_manager_environment: False
conda:
- qcfractal-worker-psi4-18.1 # name of conda env used by worker; see below for example
worker_init:
- source /home/users/peastman/worker_init.sh The one node where it works has 128 cores and 1 TB of memory. On that node, |
Here is the submission script: #! /usr/bin/bash
#SBATCH --job-name=qcfractal
#SBATCH --partition=normal
#SBATCH -t 2-00:00:00
#SBATCH --nodes=1
#SBATCH --ntasks=32
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=7gb
## USAGE
# Make sure to run bashrc
source $HOME/.bashrc
# Don't limit stack size
ulimit -s unlimited
# Activate QCFractal conda env
conda activate qcfractalcompute
# Create a YAML file with specific substitutions
export MAX_WORKERS=1
envsubst < qcfractal-manager-config.yml > configs/config.${SLURM_JOBID}.yml
# Run qcfractal-compute-manager
qcfractal-compute-manager --config configs/config.${SLURM_JOBID}.yml Some things I tried that didn't help:
|
I'm not positive (I'm not a slurm expert), but I believe this might limit things to 1 core/7GB memory (because ntasks is largely for MPI). I am trying to verify that, but SLURM is one of those things I have to re-learn every time I use it. Could you try (just swaps ntasks and cpus-per-task)
|
I submitted a job with those settings. I'll let you know what happens. |
It still failed. Check task 18909693. |
@bennybp : Any chance you are able to look at this? |
I will try running the |
It presumably completed on a node with 1 TB of memory. I can run these calculations on those nodes, but not on ones with 256 GB. |
We might have to debug this live over zoom. There are a few possibilities. One possibility is that a ramdisk is being used for storage (not sure how, since I just ran it interactively on one of our nodes and there wasn't any problem. I ran several at the same time in order to fill up the local scratch. This caused an error, but it was not the PSIO error we've seen. It could have caused the "unknown errors" though. I submitted the errored task to a private instance to make sure the managers behave correctly. I submitted a manager job pretty much identically to yours, and it all behaves as expected, with only one psi4 job running using 32 cores. One difference is that I submitted mine with If your interested in a live debugging session, send me an email. I will have time next week. |
I'll try using That dataset has now finished, computed on only the 1 TB, 128 core nodes. The dataset I'm working on right now has smaller molecules, 50 atoms maximum, that can run successfully on the smaller nodes. But one of the other datasets I'll be computing later has even bigger ones, up to 110 atoms. |
This still sounds like a local cluster batch submission script configuration issue, right? @dotsdl @mikemhenry : Is there any way for us to try running workers on our local cluster (lilac) as well, where we are more confident we have correct configuration settings, so they can monitor if those fail too? Eventually we will want pass our own QCFractal responsibilities to @chrisiacovella too, so this might be a good option for training. |
It's still failing when using How did you find the error log shown above? As before, all three records have |
While running calculations, I find that some records have their
error
field set, even though thestatus
field does not indicate an error. Some arerunning
and others arewaiting
. In all cases the value of theerror
field isWhat does this mean?
The text was updated successfully, but these errors were encountered: