cloudonaut Posted March 21, 2023 Share Posted March 21, 2023 In recent months, I was again reminded that EC2 spot capacity is not always available. For years, I was looking for a safety net for my spot-based Auto Scaling Groups (ASGs). If spot capacity is unavailable, launch on-demand EC2 instances and replace them with spot as soon as spot capacity is back. After many proofs of concept, I want to share my approach to the problem. /images/2023/03/safety-net.jpg I assume your existing ASG is configured to spread the load across as many availability zones and instance types as possible. Besides that, I encourage you to enable Capacity Rebalancing to handle spot interruptions. Besides that, add the following resources to implement the on-demand safety net: Fallback ASG to launch on-demand EC2 instances Two step scaling policies to scale up/down the fallback ASG Two CloudWatch alarms to trigger the scaling policies Configure existing ASG Enable your existing ASG to emit the CloudWatch metrics GroupInServiceInstances and GroupDesiredCapacity. In CloudFormation: SpotAutoScalingGroup: Type: 'AWS::AutoScaling::AutoScalingGroup' Properties: # [...] CapacityRebalance: true MaxSize: 10 MinSize: 2 MixedInstancesPolicy: # [...] InstancesDistribution: OnDemandAllocationStrategy: prioritized OnDemandBaseCapacity: 0 OnDemandPercentageAboveBaseCapacity: 0 SpotAllocationStrategy: 'capacity-optimized-prioritized' MetricsCollection: - Granularity: 1Minute Metrics: - GroupInServiceInstances - GroupDesiredCapacity Configure additional fallback ASG Add a new ASG to spin up on-demand capacity. Use the same launch template/configuration as your spot ASG. FallbackAutoScalingGroup: Type: 'AWS::AutoScaling::AutoScalingGroup' Properties: # [...] MetricsCollection: - Granularity: 1Minute Metrics: - GroupInServiceInstances - GroupDesiredCapacity MaxSize: 10 # set this to the same value as your spot max size MinSize: 0 Create CloudWatch alarms to trigger auto-scaling The trick is to use the following formula to calculate the number of instances that need to be added/removed from the fallback ASG: desired spot-running spot-desired fallback The following table helps you to understand the formula with some examples: example desired spot running spot desired fallback result all good, spot capacity is available 4 4 0 0 spot capacity is missing 4 3 0 1 spot capacity is missing, but fallback capacity is already started 4 3 1 0 spot capacity is available; fallback capacity can be removed 4 4 1 -1 The following logic is needed to work with the result of the formula: If result > 0: increase the desired capacity of the fallback ASG by result.Else if result < 0: decrease the desired capacity of the fallback ASG by result.Else: do nothing. The logic can be implemented with CloudWatch alarms and step scaling policies. CloudWatch alarms trigger the step scaling policies to scale up/down the fallback ASG. To reduce noise caused by auto-scaling activities in the spot ASG, I configured the alarms only to fire if the formula is negative/positive three times in a row. The following two CloudWatch alarms are mostly identical, except for the ComparisonOperator. FallbackScaleUpAlarm: Type: 'AWS::CloudWatch::Alarm' Properties: AlarmActions: - !Ref FallbackScaleUp ComparisonOperator: GreaterThanThreshold EvaluationPeriods: 3 # if for three times in a row... Threshold: 0 # ...the formula result is > 0, trigger alarm TreatMissingData: notBreaching Metrics: - Id: running # get the value for running spot Label: running MetricStat: Metric: Namespace: 'AWS/AutoScaling' MetricName: GroupInServiceInstances Dimensions: - Name: AutoScalingGroupName Value: !Ref SpotAutoScalingGroup Period: 60 Stat: Maximum ReturnData: false - Id: desired # get the value for desired spot Label: desired MetricStat: Metric: Namespace: 'AWS/AutoScaling' MetricName: GroupDesiredCapacity Dimensions: - Name: AutoScalingGroupName Value: !Ref SpotAutoScalingGroup Period: 60 Stat: Maximum ReturnData: false - Id: desiredfallback # get the value for desired fallback Label: desiredfallback MetricStat: Metric: Namespace: 'AWS/AutoScaling' MetricName: GroupDesiredCapacity Dimensions: - Name: AutoScalingGroupName Value: !Ref FallbackAutoScalingGroup Period: 60 Stat: Maximum ReturnData: false - Expression: 'desired-running-desiredfallback' # this is the formula presented earlier Id: e1 Label: 'fallback' ReturnData: trueFallbackScaleDownAlarm: Type: 'AWS::CloudWatch::Alarm' Properties: AlarmActions: - !Ref FallbackScaleDown ComparisonOperator: LessThanThreshold EvaluationPeriods: 3 # if for three times in a row... Threshold: 0 # ...the formula result is < 0, trigger alarm TreatMissingData: notBreaching Metrics: # [...] same as in FallbackScaleUpAlarm In an ideal world, we could use the result of the formula to change the desired capacity directly. Remember, the formula calculates the instances that need to be added (positive values)/removed (negative values) from the fallback ASG. Unfortunately, we must take a slight detour via a step scaling policy. The CloudWatch alarm triggers the step scaling policy with the formula result. The step scaling policy translates the received value into a change in capacity (adjustment)… …and updates the desired count of the ASG. You can configure how the step scaling policy transforms the value from CloudWatch into a change in capacity by defining step adjustments. A step is defined by a lower and upper bound and a change in capacity. I use the following steps to translate from the formula result to a change in desired capacity: policy range change in desired capacity up 0 <= result < 2 +1 up 2 <= result < 3 +2 up 3 <= result < 4 +3 up 4 <= result < 5 +4 up 5 <= result < 10 +5 up 10 <= result < 25 +10 up 25 <= result < +infinity +25 down 0 >= fallback > -2 -1 down -2 >= fallback > -3 -2 down -3 >= fallback > -4 -3 down -4 >= fallback > -5 -4 down -5 >= fallback > -infinity -5 You can define up to 20 adjustments per step scaling policy. FallbackScaleUp: Type: 'AWS::AutoScaling::ScalingPolicy' Properties: AdjustmentType: ChangeInCapacity AutoScalingGroupName: !Ref FallbackAutoScalingGroup EstimatedInstanceWarmup: 300 MetricAggregationType: Average PolicyType: StepScaling StepAdjustments: # the lower bound is inclusive and the upper bound is exclusive - MetricIntervalLowerBound: 0 MetricIntervalUpperBound: 2 ScalingAdjustment: 1 - MetricIntervalLowerBound: 2 MetricIntervalUpperBound: 3 ScalingAdjustment: 2 - MetricIntervalLowerBound: 3 MetricIntervalUpperBound: 4 ScalingAdjustment: 3 - MetricIntervalLowerBound: 4 MetricIntervalUpperBound: 5 ScalingAdjustment: 4 - MetricIntervalLowerBound: 5 MetricIntervalUpperBound: 10 ScalingAdjustment: 5 - MetricIntervalLowerBound: 10 MetricIntervalUpperBound: 25 ScalingAdjustment: 10 - MetricIntervalLowerBound: 25 ScalingAdjustment: 25FallbackScaleDown: Type: 'AWS::AutoScaling::ScalingPolicy' Properties: AdjustmentType: ChangeInCapacity AutoScalingGroupName: !Ref FallbackAutoScalingGroup EstimatedInstanceWarmup: 300 MetricAggregationType: Average PolicyType: StepScaling StepAdjustments: # the lower bound is exclusive and the upper bound is inclusive - MetricIntervalUpperBound: 0 MetricIntervalLowerBound: -2 ScalingAdjustment: -1 - MetricIntervalUpperBound: -2 MetricIntervalLowerBound: -3 ScalingAdjustment: -2 - MetricIntervalUpperBound: -3 MetricIntervalLowerBound: -4 ScalingAdjustment: -3 - MetricIntervalUpperBound: -4 MetricIntervalLowerBound: -5 ScalingAdjustment: -4 - MetricIntervalUpperBound: -5 ScalingAdjustment: -5 Summary The following graph shows the fallback in action: /images/2023/03/fallback-in-action.png The red line shows the desired spot, the orange line shows the running spot, and the green line shows the running fallback. 9:25 two spot instances are desired and running (desired spot = 4; running spot = 2). 9:27 one additional spot instance is requested (desired spot = 3). 9:32 spot capacity not available; one fallback instance is running (desired spot = 3; running spot = 2; running fallback = 1) 9:35 one additional spot instance is requested (desired spot = 4) 9:40 spot capacity not available; two fallback instances are running (desired spot = 4; running spot = 2; running fallback = 2) As you can see, it takes around 5 minutes for on-demand capacity to replace the missing spot capacity. This is caused by the 3 x 1-minute delay added by the CloudWatch alarm configuration and the delay introduced by starting an EC2 instance before it influences the GroupInServiceInstances metric. You could remove up to 2 minutes of delay by adjusting the CloudWatch alarms to only wait for one or two threshold violations before triggering the scaling action. View the full article Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.