Bugfix/simplify autoscaling #213

aidanrussell · 2025-01-15T14:53:08Z

This started as a quick bugfix but evolved into a larger piece of work to simplify the way the alarms and autoscaling is working, as it was felt to be confusing and hard to debug in its prior implementation.

The original bug is that scale-down based on backlog was not working at all in the prior state, rather scale-down by CPU was being applied (this is why it hadn't been noticed). This broke when Llama was added because that was using ~7% CPU at rest compared to a 5% CPU threshold, and so even with no backlog it was failing to scale down.

In the new implementation, scaling is split between a logic for 0-1 instance and a logic for anything above 1 instances. This means rules do not overlap - rather scale up/down for backlog is for 0-1 only and scale up/down for CPU/GPU is for 1+ only. Furthermore where possible alarms are created that trigger actions both in alarm state and also an opposing action in non-alarm (ok) state - this helps ensure consistency as there is only one rule to edit instead of two (actually in the end this could only be done for the backlog alarm and not the others).

joehearnshaw-6point6 · 2025-01-20T16:58:00Z

infra/sagemaker_llm_resources.tf

No metrics that are normalized exist for the async endpoint, thus these alarms will not record data and need reverting to prior logic. Specifically, i.e. CPUUtilizationNormalized - this does NOT exist. For any metrics you can or cannot use, please refer to Cloudwatch metrics logs for any deployed sagemaker endpoints. You will note too from the docs that normalized metrics only exist for real time instances. We will need to therefore ensure that we normalize the metrics ourselves based upon instance type configuration.

Note that while the ok_action is completely fine to exist, we have removed the SNS topic subscription for the alarms, thus losing slack webhooks and the ability to add other SNS alerts. Reimplement this (in side-step with sagemaker_deployment/main.tf changes for the alarm and associated SNS topics), i.e. reintroduce sns_topic_name or something similar here and add:

# Local for mapping SNS topic ARNs to Slack webhook URLs locals { sns_to_webhook_mapping = { for idx, alarm in local.alarms_with_sns : aws_sns_topic.sns_topic[idx].arn => alarm.slack_webhook_url } } resource "aws_cloudwatch_metric_alarm" "cloudwatch_alarm" { count = length(var.alarms) alarm_name = var.alarms[count.index].alarm_name alarm_description = var.alarms[count.index].alarm_description metric_name = var.alarms[count.index].metric_name namespace = var.alarms[count.index].namespace comparison_operator = var.alarms[count.index].comparison_operator threshold = var.alarms[count.index].threshold evaluation_periods = var.alarms[count.index].evaluation_periods datapoints_to_alarm = var.alarms[count.index].datapoints_to_alarm period = var.alarms[count.index].period statistic = var.alarms[count.index].statistic # Define dimensions based on the count index - # first alarm will not have a null variantName dimensions = count.index == 0 ? { EndpointName = aws_sagemaker_endpoint.sagemaker_endpoint.name } : { EndpointName = aws_sagemaker_endpoint.sagemaker_endpoint.name, VariantName = var.variant_name } # Conditionally add SNS topics as alarm actions & lazy eval alarm_actions = concat( var.alarms[count.index].alarm_actions != null ? var.alarms[count.index].alarm_actions : [], var.alarms[count.index].sns_topic_name != null ? [ lookup( { for idx, alarm in local.alarms_with_sns : alarm.sns_topic_name => aws_sns_topic.sns_topic[idx].arn }, var.alarms[count.index].sns_topic_name, null ) ] : [] ) }

To sagemaker_deployment/main.tf with the necessary changes to reflect the correct alarm actions being implemented, too.

aidanrussell requested a review from a team as a code owner January 15, 2025 14:53

aidanrussell changed the base branch from main to feat/sagemaker-llms January 15, 2025 14:54

simplify autoscaling process

7267ac6

aidanrussell force-pushed the bugfix/simplify_autoscaling branch from 10e0c56 to 7267ac6 Compare January 20, 2025 11:57

aidanrussell requested review from joehearnshaw-6point6 and isobel-daley-6point6 January 20, 2025 12:00

joehearnshaw-6point6 reviewed Jan 20, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bugfix/simplify autoscaling #213

Bugfix/simplify autoscaling #213

aidanrussell commented Jan 15, 2025 •

edited

Loading

joehearnshaw-6point6 Jan 20, 2025 •

edited

Loading

Bugfix/simplify autoscaling #213

Are you sure you want to change the base?

Bugfix/simplify autoscaling #213

Conversation

aidanrussell commented Jan 15, 2025 • edited Loading

joehearnshaw-6point6 Jan 20, 2025 • edited Loading

Choose a reason for hiding this comment

aidanrussell commented Jan 15, 2025 •

edited

Loading

joehearnshaw-6point6 Jan 20, 2025 •

edited

Loading