-
Notifications
You must be signed in to change notification settings - Fork 530
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nipype support with fuse filesystems #3288
Comments
@sulantha2006 - there is an execution config option ( https://miykael.github.io/nipype_tutorial/notebooks/basic_execution_configuration.html#Execution |
Sure I will try, but is this param works when SGE/PBS or any other execution plugin is not used? I run the pipeline as workflow.run(). |
any distributed plugin if i remember correctly, but may have been created specifically for slurm/pbs/etc. however, i would not expect an issue with fuse necessarily on a single system (since the local fuse mapping should always be ok), only across systems. but perhaps you are noticing this on a single system, and not using a batch system like slurm. are you running into this error when running the container in a given machine, with simply pointing at a working directory that is over fuse. one option would be to make the work dir local and then copy it over if necessary. |
So my setup is a slurm cluster but I don't install anything on the
controller or login node. I use it only to submit jobs.
I submit a slurm job per image I have to process which calls a python
script inside a container. This script is a nipype workflow. Container has
all tools and nipype installed.
The workdir is a mounted fuse dir bound to the container. I also have a
data input and datasink mounted the same way. Didnt see any issues with
those.
Each compute node is dynamic with a small local storage so local workdir is
not possible. Also I want to keep the workflow files without modifying to
skip reprocessing time consuming tasks.
Also, if I were to transfer all workdir content before a job and after a
job, I assume it would significantly increase cloud costs.
The failures are random. Out of 950 jobs, about 30-50 failed with some
intermediate file in the workdir missing (Output of a previous node).
Others completed successfully.
My assumption is that, the previous node output is read before the file is
written and present through the fuse system.
Ideal solution should have a delayed retry for node inputs ? Is it easy to
implement?
…On Thu, Jan 7, 2021, 11:11 AM Satrajit Ghosh ***@***.***> wrote:
any distributed plugin if i remember correctly, but may have been created
specifically for slurm/pbs/etc. however, i would not expect an issue with
fuse necessarily on a single system (since the local fuse mapping should
always be ok), only across systems.
but perhaps you are noticing this on a single system, and not using a
batch system like slurm. are you running into this error when running the
container in a given machine, with simply pointing at a working directory
that is over fuse. one option would be to make the work dir local and then
copy it over if necessary.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#3288 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAHZL55ZNU27V37HJ36IUCTSYYBM5ANCNFSM4VZLPXDQ>
.
|
@sulantha2006 - thank you for the details. in this case it is then interesting that the failure happens - i would not have expected this given how things work. also the https://github.com/nipy/nipype/blob/master/nipype/pipeline/engine/nodes.py#L552 every node retrieves its inputs to evaluate the hash before computing or moving on. so this would be the place to check. |
…he node. Simplified implementation for all inputs (not just File type). Need to be improved for File type later on.
…he node. Simplified implementation for all inputs (not just File type). Need to be improved for File type later on.
I am not sure the above PR solution would work. It will solve the issue of input files not being ready surely, but I am getting other fails now on the same root cause. See the above exception. Note that as before this is completely random.
|
Summary
Support for FUSE filesystem for workdir
Actual behavior
Workdir reads and writes many intermediate files, with higher than normal latency. This leads to workflows failing due to files not present. (FUSE file systems keep a copy of the file in RAM and writes once the handle is closed. But the actual time it takes the files to be present for a new process can vary).
Expected behavior
Nipype, should have a delay mechanism on file read fails possibly with exponential backoff.
Platform details:
I tested this in GCP, with gcsfuse. There were random failures of workflow files not being present.
Execution environment
GCP machine, with gcsfuse mounted workdir. Singularity container installed with required tools with nipype.
The text was updated successfully, but these errors were encountered: