Issue with AWS ECS Auto-Scaling and Binpack Task Placement Strategy: Tasks Not Shifting Back After Scale-In

In AWS ECS, I use auto-scaling and a binpack task placement strategy. I am facing an issue where, once the tasks scale up and instances are attached to ECS, after a scale-in event, some tasks remain on different instances and do not shift back to fewer instances as expected. How can I resolve this issue?

Topics

Management & Governance Containers

Tags

AWS Auto Scaling Amazon Elastic Container Service

Language

English

Tharunkumar

asked a month ago307 views

2 Answers

Newest
Most votes
Most comments

Accepted Answer

Hi Tharunkumar,

Please try this below solution, I hope it will help to resolve your issue.

Implement an ECS Task Rebalancer:

Create a Lambda Function: This function will check the task placement and stop tasks that need to be redistributed.
Invoke Lambda Function Periodically: Use CloudWatch Events to trigger the Lambda function at regular intervals.
CloudFormation Template: Use a CloudFormation template to create the Lambda function and set up the CloudWatch Event rule.

Lambda Function (Python)

This function lists tasks in your ECS cluster, groups them by instance, and stops tasks from under-utilized instances:


import boto3

ecs_client = boto3.client('ecs')

def lambda_handler(event, context):
    cluster_name = 'your-cluster-name'
    service_name = 'your-service-name'
    
    # List tasks
    tasks = ecs_client.list_tasks(cluster=cluster_name, serviceName=service_name)['taskArns']
    
    # Describe tasks
    tasks_details = ecs_client.describe_tasks(cluster=cluster_name, tasks=tasks)['tasks']
    
    # Group tasks by instance
    tasks_by_instance = {}
    for task in tasks_details:
        instance_id = task['containerInstanceArn']
        if instance_id not in tasks_by_instance:
            tasks_by_instance[instance_id] = []
        tasks_by_instance[instance_id].append(task['taskArn'])
    
    # Example logic: Stop tasks from under-utilized instances
    for instance_id, task_arns in tasks_by_instance.items():
        if len(task_arns) == 1:  # Adjust this threshold based on your binpack strategy
            ecs_client.stop_task(cluster=cluster_name, task=task_arns[0])
    
    return {
        'statusCode': 200,
        'body': 'Rebalanced tasks'
    }

CloudFormation Template

This template sets up the Lambda function and the CloudWatch Event rule to trigger it periodically:


AWSTemplateFormatVersion: '2010-09-09'
Resources:
  MyLambdaExecutionRole:
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Principal:
              Service: lambda.amazonaws.com
            Action: sts:AssumeRole
      Policies:
        - PolicyName: LambdaExecutionPolicy
          PolicyDocument:
            Version: '2012-10-17'
            Statement:
              - Effect: Allow
                Action:
                  - logs:CreateLogGroup
                  - logs:CreateLogStream
                  - logs:PutLogEvents
                  - ecs:ListTasks
                  - ecs:DescribeTasks
                  - ecs:StopTask
                Resource: '*'

  MyLambdaFunction:
    Type: AWS::Lambda::Function
    Properties:
      FunctionName: MyRebalanceFunction
      Handler: index.lambda_handler
      Role: !GetAtt MyLambdaExecutionRole.Arn
      Code:
        ZipFile: |
          import boto3
          
          ecs_client = boto3.client('ecs')

          def lambda_handler(event, context):
              cluster_name = 'your-cluster-name'
              service_name = 'your-service-name'
              
              # List tasks
              tasks = ecs_client.list_tasks(cluster=cluster_name, serviceName=service_name)['taskArns']
              
              # Describe tasks
              tasks_details = ecs_client.describe_tasks(cluster=cluster_name, tasks=tasks)['tasks']
              
              # Group tasks by instance
              tasks_by_instance = {}
              for task in tasks_details:
                  instance_id = task['containerInstanceArn']
                  if instance_id not in tasks_by_instance:
                      tasks_by_instance[instance_id] = []
                  tasks_by_instance[instance_id].append(task['taskArn'])
              
              # Example logic: Stop tasks from under-utilized instances
              for instance_id, task_arns in tasks_by_instance.items():
                  if len(task_arns) == 1:  # Adjust this threshold based on your binpack strategy
                      ecs_client.stop_task(cluster=cluster_name, task=task_arns[0])
              
              return {
                  'statusCode': 200,
                  'body': 'Rebalanced tasks'
              }
      Runtime: python3.8

  CloudWatchEventRule:
    Type: AWS::Events::Rule
    Properties:
      ScheduleExpression: rate(5 minutes)
      Targets:
        - Arn: !GetAtt MyLambdaFunction.Arn
          Id: "TargetFunctionV1"

  LambdaInvokePermission:
    Type: AWS::Lambda::Permission
    Properties:
      Action: lambda:InvokeFunction
      FunctionName: !Ref MyLambdaFunction
      Principal: events.amazonaws.com
      SourceArn: !GetAtt CloudWatchEventRule.Arn

Please go through the below useful AWS documentation links for the services involved

1. Lambda

https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-lambda-function.html

https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-lambda-permission.html

2. AWS IAM

https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-iam-role.html

3. AWS Cloud Watch Events

https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-events-rule.html

4.AWS ECS:

https://docs.aws.amazon.com/AmazonECS/latest/APIReference/API_ListTasks.html

https://docs.aws.amazon.com/AmazonECS/latest/APIReference/API_DescribeTasks.html

https://docs.aws.amazon.com/AmazonECS/latest/APIReference/API_StopTask.html

EXPERT

Pandurangaswamy

answered a month ago

EXPERT

Thanniru Anil Kumar

reviewed a month ago

Are you using a Capacity Provider for the ASG? If so, do you have Managed Termination Protection feature enabled? This will prevent instances from being scaled-in as long as there's any replica tasks running on them.

If you want to have tasks killed and replaced on new instances to binpack better, disabled Managed Termination Protection, and instead enabled Managed Draining and set the target value to 100

EXPERT

Shahad_C

answered a month ago

Tharunkumar
a month ago
I done this but it is not working

Relevant content

ECS task placement strategy to deploy across group of ec2 instances
rePost-User-2738569
asked 6 months ago
Solution for Client.InsufficientFreeAddressesInSubnet - Task Placement for EventBridge Scheduled Task on Fargate
Olly
asked 7 months ago
ECS unhealthy task is stopped but replacing task is launched after 60+ minutes
Leandro Skladnik
asked a year ago
Why does ECS binpack strategy on memory always scale up a new EC2 instance despite available resources?
Eddie
asked a month ago
Why do the tasks in my Amazon ECS cluster fail to start?
AWS OFFICIALUpdated 3 months ago
How do I troubleshoot issues related to tagging in ECS tasks?
AWS OFFICIALUpdated 2 months ago
How do I troubleshoot issues related to scheduled tasks in Amazon ECS?
AWS OFFICIALUpdated 3 months ago
How do I troubleshoot scaling issues with my Amazon ECS capacity provider?
AWS OFFICIALUpdated 2 years ago
AWS Guidance: Build a Spot Placement Score Tracker Dashboard to Optimize Resiliency and Savings
EXPERT
Markus Adhiwiyogo
published 4 months ago
Deploying Containers on AWS: A Guide to ECS and EKS
EXPERT
kranthi putti
published 4 days ago