Pods running in EKS cluster cannot initialize due to aws-node restarting possibly pointing at cni issues

We are running EKS cluster (k8s v1.27) with 4 nodes. One of the things we use the cluster for is to host gitlab runners which spawn new pods with new pipelines. These are our most active pods that are being destroyed/created, while other pods remain running steadily. What we started to see is that our gitlab jobs timeout after 10 mins (our timeout setting) due to the fact that a new pod cannot be scheduled and initialized on a node. After further digging into the issue, we notice that aws-node pods are having issues and restart and during that restart no new gitlab runner pods can be assigned to a node. What we also see is that aws-node container within aws-node pod is the one restarting and that one is currently using the following image (amazon-k8s-cni:v1.18.0-eksbuild.1) Along with aws-node pod we see ebs-csi-node also restarting and a container inside it, node-driver-registrar (csi-node-driver-registrar:v2.10.0-eks-1-29-7) goes into a CrashLoopBackOff state.

Some posts suggest that upgrading to the version of CNI we are running caused issues like this and downgrading solved it. I also see here (https://docs.aws.amazon.com/eks/latest/userguide/managing-vpc-cni.html) that it is suggested to run v1.18.2-eksbuild.1 The issue comes and goes and possibly manifests itself when we have higher demand on creating new pods, but not consistent.

We are not running more pods than max (110 per node) and we have the following settings for our VPC CNI ENABLE_PREFIX_DELEGATION = "true" WARM_PREFIX_TARGET = "1"

Topics

Containers

Tags

Amazon Elastic Kubernetes Service Containers

Language

English

IlyaK

asked 2 months ago712 views

1 Answer

Newest
Most votes
Most comments

Are these answers helpful? Upvote the correct answer to help the community benefit from your knowledge.

Hi,

Sharing some basic pointers:

For the aws-node pod which you say is restarting does the RESTART counter also increment when you check with kubectl get pod <name> -n kube-system. Also you can check pod events with kubectl describe pods <name> -n kube-system for clues. Probes failing??? Autoscaling events?
Try the same for ebs-csi pods though the reason for this pod being in CrashLoopBackOff state would be different. Since you mentioned that this issue is more visible during high load, maybe some API rate limits restrictions kicking in?
kubectl get events -n kube-system would give further insights about events in cluster for last 60 minutes.

--Syd

Syd

answered a month ago

Relevant content

Fully private eks cluster
rePost-User-1592414
asked 2 years ago
cni plugin not initialized on nodes created by Karpenter on EKS cluster with VPC-CNI add on
Srujan
asked 9 months ago
Upgrade of AWS EKS Node group failed with 'CNI plugin not initialized'
Stefan
asked 2 months ago
EKS: ARM64 nodes fail to become ready due to CNI error
AWS-User-8953692
asked 3 years ago
FAQs: Cluster networking architecture in Amazon EKS
AWS OFFICIALUpdated 2 months ago
How do I choose specific IP subnets to be used for pods in my Amazon EKS cluster?
AWS OFFICIALUpdated 2 years ago
How do I expose the Kubernetes services running on my Amazon EKS cluster?
AWS OFFICIALUpdated 2 years ago
How do I troubleshoot eksctl issues with Amazon EKS clusters and node groups?
AWS OFFICIALUpdated a year ago
AWS EKS User Access Within the Cluster
EXPERT
NARRAVULA MUNISAI TEJA
published 2 months ago
How to Upgrade Your EKS Cluster to the Latest Version?
EXPERT
dariush
published 23 days ago