Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugfix/simplify autoscaling #213

Open
wants to merge 1 commit into
base: feat/sagemaker-llms
Choose a base branch
from

Conversation

aidanrussell
Copy link

@aidanrussell aidanrussell commented Jan 15, 2025

This started as a quick bugfix but evolved into a larger piece of work to simplify the way the alarms and autoscaling is working, as it was felt to be confusing and hard to debug in its prior implementation.

The original bug is that scale-down based on backlog was not working at all in the prior state, rather scale-down by CPU was being applied (this is why it hadn't been noticed). This broke when Llama was added because that was using ~7% CPU at rest compared to a 5% CPU threshold, and so even with no backlog it was failing to scale down.

In the new implementation, scaling is split between a logic for 0-1 instance and a logic for anything above 1 instances. This means rules do not overlap - rather scale up/down for backlog is for 0-1 only and scale up/down for CPU/GPU is for 1+ only. Furthermore where possible alarms are created that trigger actions both in alarm state and also an opposing action in non-alarm (ok) state - this helps ensure consistency as there is only one rule to edit instead of two (actually in the end this could only be done for the backlog alarm and not the others).

@aidanrussell aidanrussell requested a review from a team as a code owner January 15, 2025 14:53
@aidanrussell aidanrussell changed the base branch from main to feat/sagemaker-llms January 15, 2025 14:54
Copy link

@joehearnshaw-6point6 joehearnshaw-6point6 Jan 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No metrics that are normalized exist for the async endpoint, thus these alarms will not record data and need reverting to prior logic. Specifically, i.e. CPUUtilizationNormalized - this does NOT exist. For any metrics you can or cannot use, please refer to Cloudwatch metrics logs for any deployed sagemaker endpoints. You will note too from the docs that normalized metrics only exist for real time instances. We will need to therefore ensure that we normalize the metrics ourselves based upon instance type configuration.

Note that while the ok_action is completely fine to exist, we have removed the SNS topic subscription for the alarms, thus losing slack webhooks and the ability to add other SNS alerts. Reimplement this (in side-step with sagemaker_deployment/main.tf changes for the alarm and associated SNS topics), i.e. reintroduce sns_topic_name or something similar here and add:

# Local for mapping SNS topic ARNs to Slack webhook URLs
locals {
  sns_to_webhook_mapping = {
    for idx, alarm in local.alarms_with_sns :
    aws_sns_topic.sns_topic[idx].arn => alarm.slack_webhook_url
  }
}

resource "aws_cloudwatch_metric_alarm" "cloudwatch_alarm" {
  count = length(var.alarms)

  alarm_name          = var.alarms[count.index].alarm_name
  alarm_description   = var.alarms[count.index].alarm_description
  metric_name         = var.alarms[count.index].metric_name
  namespace           = var.alarms[count.index].namespace
  comparison_operator = var.alarms[count.index].comparison_operator
  threshold           = var.alarms[count.index].threshold
  evaluation_periods  = var.alarms[count.index].evaluation_periods
  datapoints_to_alarm = var.alarms[count.index].datapoints_to_alarm
  period              = var.alarms[count.index].period
  statistic           = var.alarms[count.index].statistic

  # Define dimensions based on the count index -
  # first alarm will not have a null variantName
  dimensions = count.index == 0 ? {
    EndpointName = aws_sagemaker_endpoint.sagemaker_endpoint.name
    } : {
    EndpointName = aws_sagemaker_endpoint.sagemaker_endpoint.name,
    VariantName  = var.variant_name
  }

  # Conditionally add SNS topics as alarm actions & lazy eval
  alarm_actions = concat(
    var.alarms[count.index].alarm_actions != null ? var.alarms[count.index].alarm_actions : [],
    var.alarms[count.index].sns_topic_name != null ? [
      lookup(
        { for idx, alarm in local.alarms_with_sns :
          alarm.sns_topic_name => aws_sns_topic.sns_topic[idx].arn
        },
        var.alarms[count.index].sns_topic_name,
        null
      )
    ] : []
  )

}

To sagemaker_deployment/main.tf with the necessary changes to reflect the correct alarm actions being implemented, too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants