-
-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] Progress tracking and scheduling #1
Open
anibalsolon
wants to merge
103
commits into
FCP-INDI:develop
Choose a base branch
from
radiome-lab:feature/progress-tracking
base: develop
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
[Feature] Progress tracking and scheduling #1
anibalsolon
wants to merge
103
commits into
FCP-INDI:develop
from
radiome-lab:feature/progress-tracking
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…to make them consistend w each other
…er, use files to move configs around docker
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fixes
Related to FCP-INDI/C-PAC#1363 by @sgiavasis
Description
This PR creates a C-PAC (or virtually anything) scheduler, with an API interface, that allows running containerized images and checking its progress. It has a CLI, so the user can start up and configure the scheduler, and an API interface, to communicate with the C-PAC GUI project mainly.
Technical details
The implementation relies heavily on the
asyncio
API, to simplify concurrency. However, it is not a parallel API, meaning that everything is executed in the same thread (and there is no race condition) and the different tasks that are being executed concurrently must not block the asyncio execution (e.g. it can have an asyncio.sleep in a task, or an IO function). Considering this past detail, all the feature implementations must have this in mind, which is why it is hard to leverage all the asyncio potential in projects that were not thought to work this way (e.g. nipype → pydra). The good thing is that, given it is single threaded, it eases a lot to handle different moving parts, while on parallel setups one would have to use different mechanisms of communication to avoid race conditions.That said, we have # main parts on this implementation: Scheduler, Backend, Shedule (and its children), Message, Result, and the API.
Beggining with the Schedule. Schedule is a abstraction of the task it should be executed. For C-PAC, we have three tasks:
Supposedly, it handles the logic aspects of it in terms of the abstract task they are performing. More technical aspects, such as running containers, are handled by a specialization of the Schedule class: BackendSchedule. The BackendSchedules are specific to a Backend, an interface between Python & the softwares of a specific backend (e.g. singularity binaries). The Backend must contains the parameters required for the BackendSchedules to properly communicate with the underlying softwares, such as the Docker image to be used or the SSH connection to access a SLURM cluster.
The Scheduler is the central part of this implementation, and maybe the most simpler. It stores the Schedules into a tree-like structure, since Schedules can spawn new Schedules, and manage the Messages received by each Schedule, together with the callbacks associated to a Schedule Message type. When a Schedule is scheduled, the Scheduler will send the Schedule to its Backend, and the Backend will specialize this "naive" Schedule into a BackendSchedule for that Backend:
This "backend-aware" Schedule (from the superclass BackendSchedule) will then be executed by the Scheduler. The BackendSchedule behave as a Python generator, so the Scheduler simply iterate this object, and the items of this iteration are defined as Messages. The Messages are data classes (i.e. only store data, not methods), to give information for the Scheduler about the execution. The Messages are relayed to Scheduler watchers, which are external agents that provide a callback function for the Scheduler to call when it receives a specific type of Message. For the Spawn Message, the Scheduler schedules a new Schedule, with the parameters contained in the Spawn message.
Specifics of the Docker and Singularity containers are actually the same: they share the same base code for container execution, only differing in the container creation.
When the container is created, three tasks run concurrently for this Schedule: container status, log listener, and file listener. The first yields Messages of type Status, as a ping, so we know the container is running fine. The second connects to the websocket server running in the container, to capture which nodes it has run so far, and yield Messages of type Log. The last one looks in the output directory for logs and crashes, storing the files as Results in the Schedule, and yielding Messages of type Result.
Only the ParticipantPipeline has the second and the third, the others have just the container status Messages.
For SLURM, it starts connecting to the cluster via SSH. It uses the SSH multiplexing connection feature, so the authentication process happens only once, which is a good idea for connections that has a multi-factor authentication layer. After connecting to a cluster, the Backend allocates nodes to execute the Schedules and install a Miniconda & CPACpy. By using the API provided by CPACpy, the local CPACpy communicates with the node CPACpy (yes, via HTTP & WS) to run the Schedules. It uses the same API to gather the results and keep the local Schedule state updated. By default, the Singularity Backend is used by the node CPACpy to run the Schedules.
The Results are basically files in which it would be too much to transfer via WS. The API to gather the Results allow to slice the content using HTTP headers (Content-Range). It is essential for results that will be incremented during the execution (i.e. logs). Using the slice, one do not need to request for the whole file again, but only the part it does not have:
Tests
Screenshots
I mean, I can show some code, I guess...
Checklist
Update index.md
).develop
branch of the repository.visible errors.
Developer Certificate of Origin
Developer Certificate of Origin