Pods running in EKS cluster cannot initialize due to aws-node restarting possibly pointing at cni issues

0

We are running EKS cluster (k8s v1.27) with 4 nodes. One of the things we use the cluster for is to host gitlab runners which spawn new pods with new pipelines. These are our most active pods that are being destroyed/created, while other pods remain running steadily. What we started to see is that our gitlab jobs timeout after 10 mins (our timeout setting) due to the fact that a new pod cannot be scheduled and initialized on a node. After further digging into the issue, we notice that aws-node pods are having issues and restart and during that restart no new gitlab runner pods can be assigned to a node. What we also see is that aws-node container within aws-node pod is the one restarting and that one is currently using the following image (amazon-k8s-cni:v1.18.0-eksbuild.1) Along with aws-node pod we see ebs-csi-node also restarting and a container inside it, node-driver-registrar (csi-node-driver-registrar:v2.10.0-eks-1-29-7) goes into a CrashLoopBackOff state.

Some posts suggest that upgrading to the version of CNI we are running caused issues like this and downgrading solved it. I also see here (https://docs.aws.amazon.com/eks/latest/userguide/managing-vpc-cni.html) that it is suggested to run v1.18.2-eksbuild.1 The issue comes and goes and possibly manifests itself when we have higher demand on creating new pods, but not consistent.

We are not running more pods than max (110 per node) and we have the following settings for our VPC CNI ENABLE_PREFIX_DELEGATION = "true" WARM_PREFIX_TARGET = "1"

IlyaK
asked 2 months ago712 views
1 Answer
0

Hi,

Sharing some basic pointers:

  1. For the aws-node pod which you say is restarting does the RESTART counter also increment when you check with kubectl get pod <name> -n kube-system. Also you can check pod events with kubectl describe pods <name> -n kube-system for clues. Probes failing??? Autoscaling events?

  2. Try the same for ebs-csi pods though the reason for this pod being in CrashLoopBackOff state would be different. Since you mentioned that this issue is more visible during high load, maybe some API rate limits restrictions kicking in?

  3. kubectl get events -n kube-system would give further insights about events in cluster for last 60 minutes.

--Syd

profile picture
Syd
answered a month ago