Issue with AWS ECS Auto-Scaling and Binpack Task Placement Strategy: Tasks Not Shifting Back After Scale-In

0

In AWS ECS, I use auto-scaling and a binpack task placement strategy. I am facing an issue where, once the tasks scale up and instances are attached to ECS, after a scale-in event, some tasks remain on different instances and do not shift back to fewer instances as expected. How can I resolve this issue?

2 Answers
1
Accepted Answer

Hi Tharunkumar,

Please try this below solution, I hope it will help to resolve your issue.

Implement an ECS Task Rebalancer:

  1. Create a Lambda Function: This function will check the task placement and stop tasks that need to be redistributed.

  2. Invoke Lambda Function Periodically: Use CloudWatch Events to trigger the Lambda function at regular intervals.

  3. CloudFormation Template: Use a CloudFormation template to create the Lambda function and set up the CloudWatch Event rule.

Lambda Function (Python)

This function lists tasks in your ECS cluster, groups them by instance, and stops tasks from under-utilized instances:


import boto3

ecs_client = boto3.client('ecs')

def lambda_handler(event, context):
    cluster_name = 'your-cluster-name'
    service_name = 'your-service-name'
    
    # List tasks
    tasks = ecs_client.list_tasks(cluster=cluster_name, serviceName=service_name)['taskArns']
    
    # Describe tasks
    tasks_details = ecs_client.describe_tasks(cluster=cluster_name, tasks=tasks)['tasks']
    
    # Group tasks by instance
    tasks_by_instance = {}
    for task in tasks_details:
        instance_id = task['containerInstanceArn']
        if instance_id not in tasks_by_instance:
            tasks_by_instance[instance_id] = []
        tasks_by_instance[instance_id].append(task['taskArn'])
    
    # Example logic: Stop tasks from under-utilized instances
    for instance_id, task_arns in tasks_by_instance.items():
        if len(task_arns) == 1:  # Adjust this threshold based on your binpack strategy
            ecs_client.stop_task(cluster=cluster_name, task=task_arns[0])
    
    return {
        'statusCode': 200,
        'body': 'Rebalanced tasks'
    }

CloudFormation Template

This template sets up the Lambda function and the CloudWatch Event rule to trigger it periodically:


AWSTemplateFormatVersion: '2010-09-09'
Resources:
  MyLambdaExecutionRole:
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Principal:
              Service: lambda.amazonaws.com
            Action: sts:AssumeRole
      Policies:
        - PolicyName: LambdaExecutionPolicy
          PolicyDocument:
            Version: '2012-10-17'
            Statement:
              - Effect: Allow
                Action:
                  - logs:CreateLogGroup
                  - logs:CreateLogStream
                  - logs:PutLogEvents
                  - ecs:ListTasks
                  - ecs:DescribeTasks
                  - ecs:StopTask
                Resource: '*'

  MyLambdaFunction:
    Type: AWS::Lambda::Function
    Properties:
      FunctionName: MyRebalanceFunction
      Handler: index.lambda_handler
      Role: !GetAtt MyLambdaExecutionRole.Arn
      Code:
        ZipFile: |
          import boto3
          
          ecs_client = boto3.client('ecs')

          def lambda_handler(event, context):
              cluster_name = 'your-cluster-name'
              service_name = 'your-service-name'
              
              # List tasks
              tasks = ecs_client.list_tasks(cluster=cluster_name, serviceName=service_name)['taskArns']
              
              # Describe tasks
              tasks_details = ecs_client.describe_tasks(cluster=cluster_name, tasks=tasks)['tasks']
              
              # Group tasks by instance
              tasks_by_instance = {}
              for task in tasks_details:
                  instance_id = task['containerInstanceArn']
                  if instance_id not in tasks_by_instance:
                      tasks_by_instance[instance_id] = []
                  tasks_by_instance[instance_id].append(task['taskArn'])
              
              # Example logic: Stop tasks from under-utilized instances
              for instance_id, task_arns in tasks_by_instance.items():
                  if len(task_arns) == 1:  # Adjust this threshold based on your binpack strategy
                      ecs_client.stop_task(cluster=cluster_name, task=task_arns[0])
              
              return {
                  'statusCode': 200,
                  'body': 'Rebalanced tasks'
              }
      Runtime: python3.8

  CloudWatchEventRule:
    Type: AWS::Events::Rule
    Properties:
      ScheduleExpression: rate(5 minutes)
      Targets:
        - Arn: !GetAtt MyLambdaFunction.Arn
          Id: "TargetFunctionV1"

  LambdaInvokePermission:
    Type: AWS::Lambda::Permission
    Properties:
      Action: lambda:InvokeFunction
      FunctionName: !Ref MyLambdaFunction
      Principal: events.amazonaws.com
      SourceArn: !GetAtt CloudWatchEventRule.Arn

Please go through the below useful AWS documentation links for the services involved

1. Lambda

https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-lambda-function.html

https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-lambda-permission.html

2. AWS IAM

https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-iam-role.html

3. AWS Cloud Watch Events

https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-events-rule.html

4.AWS ECS:

https://docs.aws.amazon.com/AmazonECS/latest/APIReference/API_ListTasks.html

https://docs.aws.amazon.com/AmazonECS/latest/APIReference/API_DescribeTasks.html

https://docs.aws.amazon.com/AmazonECS/latest/APIReference/API_StopTask.html

EXPERT
answered a month ago
EXPERT
reviewed a month ago
0

Are you using a Capacity Provider for the ASG? If so, do you have Managed Termination Protection feature enabled? This will prevent instances from being scaled-in as long as there's any replica tasks running on them.

If you want to have tasks killed and replaced on new instances to binpack better, disabled Managed Termination Protection, and instead enabled Managed Draining and set the target value to 100

AWS
EXPERT
answered a month ago
  • I done this but it is not working