Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[IT-2361] Custom Batch compute environment #180

Open
wants to merge 6 commits into
base: dev
Choose a base branch
from

Conversation

zaro0508
Copy link
Contributor

@zaro0508 zaro0508 commented Apr 10, 2023

Setup a custom batch compute environment to be used with nexflow
tower in a tower launch configuration so that we can avoid some of the
limitations of the tower forge configuration.

depends on Sage-Bionetworks/aws-infra#392

@zaro0508 zaro0508 requested a review from a team as a code owner April 10, 2023 17:44
MaxValue: 16
Default: 1
Resources:
UserManagedPolicy:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is setup from the configure AWS batch manually docs however I'm not sure how it's used. I'm wondering whether this is needed at all? would the user actually be using the nextflow-launch-iam-policy.yaml instead?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if they meant "IAM user" or "Tower user" in this context. Either way, Tower users wouldn't be launching workflows directly on AWS Batch themselves. They would be launching them via Tower. I also don't see how this policy gets used later in the documentation, so I'm not sure we need it.

Here's the general flow:

  1. A user submit a workflow using the Tower web UI or API
  2. The Tower ECS service uses its IAM role to assume a project-specific role (the goal of [IT-2360] Setup IAM roles for tower #181), which has permissions to that project's Batch resources and S3 buckets.
  3. The project-specific role submits the Nextflow head job to an on-demand queue (we configure Tower so it knows which Batch queue to use).
  4. The Nextflow head job uses its own role to submit the worker jobs to either the on-demand or spot queue.

'Fn::Sub': '${AWS::Region}-nextflow-ecs-cluster-EcsLaunchTemplate'
Version: !ImportValue
'Fn::Sub': '${AWS::Region}-nextflow-ecs-cluster-EcsLaunchTemplateLatestVersionNumber'
JobQueueOnDemand:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these on demand and spot queues are separate and requires users to specify which queue to put their jobs on. I assume NF tower has a way for users to specify this either using the NF CLI or console?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, there is a way to do this within the console and it looks like this:

image

@thomasyu888
Copy link
Collaborator

thomasyu888 commented Apr 10, 2023

Thanks @zaro0508 for this work. Bruno is on vacation for the next 2 weeks, so I'd like to wait for him to be back to fully test this.

In the interim, I think that we need to have a system for triggering the production deployment rather than relying on the successful deployment to dev (like using github tags/releases). The reason is that some of these changes have a big impact on the way users interact with Tower, so I'd want to go in and make sure everything works as expected by running test workflows on the dev deployment.

@zaro0508
Copy link
Contributor Author

By all means @thomasyu888, do what you think is best. if you need IT help then please enter a jira issue and let us know.

@thomasyu888
Copy link
Collaborator

Thanks @zaro0508! I created a ticket to track: https://sagebionetworks.jira.com/browse/IT-2790

Copy link
Contributor

@BrunoGrandePhD BrunoGrandePhD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't taken a close look at the resources since this needs to be restructured to be created for each Tower project. Once we have that set up, we can test the deployment.

MaxValue: 16
Default: 1
Resources:
UserManagedPolicy:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if they meant "IAM user" or "Tower user" in this context. Either way, Tower users wouldn't be launching workflows directly on AWS Batch themselves. They would be launching them via Tower. I also don't see how this policy gets used later in the documentation, so I'm not sure we need it.

Here's the general flow:

  1. A user submit a workflow using the Tower web UI or API
  2. The Tower ECS service uses its IAM role to assume a project-specific role (the goal of [IT-2360] Setup IAM roles for tower #181), which has permissions to that project's Batch resources and S3 buckets.
  3. The project-specific role submits the Nextflow head job to an on-demand queue (we configure Tower so it knows which Batch queue to use).
  4. The Nextflow head job uses its own role to submit the worker jobs to either the on-demand or spot queue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will need to deploy these resources once per Tower project, similar to what I described in my review for #181. I'll let you determine the best way to lay this out using Sceptre because I know tower-project.j2 is already big enough as it is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will need to deploy these resources once per Tower project

i have moved the deployment of nextflow-launch.yaml into tower-project.j2

Setup to create a tower laucnh stack for every project.
We move the tower launch template to aws-infra repo[1] and
deploy it as a nested stack.

[1] https://github.com/Sage-Bionetworks/aws-infra
@BrunoGrandePhD BrunoGrandePhD removed their request for review May 25, 2023 20:32
@dpulls
Copy link

dpulls bot commented Jul 10, 2023

🎉 All dependencies have been resolved !

Copy link

sonarcloud bot commented Dec 15, 2023

Quality Gate Passed Quality Gate passed

Kudos, no new issues were introduced!

0 New issues
0 Security Hotspots
No data about Coverage
0.0% Duplication on New Code

See analysis details on SonarCloud

@zaro0508 zaro0508 requested a review from BWMac October 22, 2024 14:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants