Cilium stuck in CrashLoopBackOff after restarting AWS instances

jchen4520 · September 23

I completed section 3 labs without a problem following the provided material. I used AWS to spin up the virtual machines.

I stopped the AWS EC2 instances as they were expensive to run when I wasn't using them for labs. However, once I came back to work on section 4 labs, I discovered in lab 4.1 that my cluster was failing the upgrade health check.

On investigating this, I saw Cilium related pods on my worker node constantly in CrashLoopBackOff state, with an increasing number of resets. Over time, even the kube-proxy on the worker node entered CrashLoopBackOff state.

I've tried to fix this by removing the worker node from the cluster, removing all the Cilium-related directories and info there, and then readding the worker node to the cluster. However, the problem persists.

Given I've just started the course, I'm not sure what I should do next to fix this. Has anyone else had this issue?

jchen4520 · September 23

I've attached some of the output from describing pods and system event logs.

chrispokorni · September 25

Hi @jchen4520,

Are your EC2 instances preserving their private IP addresses between reboots?
What are the sizes of your instances - CPU, RAM, disk?
Is the SecurityGroup protecting your VMs allowing all ingres (inbound) traffic - all protocols, from all sources, to all port destinations?

Regards,
-Chris

jchen4520 · October 1

Hi @chrispokorni,

As far as I can tell they are preserving their private IP addresses between reboots - although I noticed the first time I tried to troubleshoot (ran hostname -i on the worker) there were two IP addresses, the original IP and a 192.XX.XX.XX IP. After I ran it again a few minutes later it returned only the original IP, but I wasn't sure if that was expected as part of AWS restarting the instance.

The instances are m7i-flex.large: 2vCPU, 8GiB memory. I believe the storage is 20GiB. It's running ubuntu/images/hvm-ssd-gp3/ubuntu-noble-24.04-amd64-server-20250821.

I also checked the security group rule and I have allow all ports, all protocols, 0.0.0.0/0 source for inbound traffic. Similarly, outbound traffic allows all ports, all protocols, 0.0.0.0/0 destination for outbound traffic.

chrispokorni · October 1

Hi @jchen4520,

Your EC2 instances seem to be adequately sized, the OS and SG as recommended.
It is important that the EC2 instances preserve their original private IP addresses, the ones in the 172 range. So if the cp instance original private IP was 172.x.y.6, after reboot it should have the same 172.x.y.6 private IP.
A 192 IP address was assigned to a new interface on each EC2 instance respectively when the Cilium plugin initialized the CNI network. This IP missing after reboot means the CNI plugin no longer operates as expected.

What are the entries of the hosts files on each EC2 instance?
What are the outputs of the following commands:

kubectl get nodes -o wide
kubectl events -A
kubectl -n kube-system describe pod <pods-in-CrashLoopBackOff>

Also, assuming the name of the etcd pod is etcd-cp:

kubectl -n kube-system exec etcd-cp -- \
  cat /etc/kubernetes/pki/apiserver.crt | \
  openssl x509 -noout -text | \
  grep -A1 -i alt

Regards,
-Chris

jchen4520 · October 2

Hosts file for control plane:

172.31.15.6 k8scp
172.31.15.6 cp
127.0.0.1 localhost

# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts

Hosts file for worker is line by line the exact same file.

1.

kubectl get nodes -o wide
NAME               STATUS                     ROLES           AGE   VERSION   INTERNAL-IP     EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION    CONTAINER-RUNTIME
cp                 Ready,SchedulingDisabled   control-plane   24d   v1.32.1   172.31.15.6     <none>        Ubuntu 24.04.3 LTS   6.14.0-1012-aws   containerd://1.7.27
ip-172-31-15-167   Ready                      <none>          9d    v1.32.1   172.31.15.167   <none>        Ubuntu 24.04.3 LTS   6.14.0-1012-aws   containerd://1.7.27

2 & 3 I will attach as text files as output is large. for 2 I only put the logs from the most recent reboot as otherwise it would be even longer.

For etcd-cp, the command errored out (I think because cat was not on the etcd-cp?) but I ran it just on the cp and got this:

 sudo openssl x509 -in /etc/kubernetes/pki/apiserver.crt -noout -text | grep -A1 -i alt
            X509v3 Subject Alternative Name:
                DNS:cp, DNS:k8scp, DNS:kubernetes, DNS:kubernetes.default, DNS:kubernetes.default.svc, DNS:kubernetes.default.svc.cluster.local, IP Address:10.96.0.1, IP Address:172.31.15.6

chrispokorni · October 2

Hi @jchen4520,

The first concerning item is the cp node's state. The SchedulingDisabled should be fixed first on the cp, to allow the cluster to launch all necessary components on the cp node.

Did you successfully drain the cp node, as instructed in lab 4.1 step 7? Did you have a chance to upgrade kubeadm to v1.33.1 per step 9? Did you perhaps reboot the cp node before reaching step 16?

If the upgrade was not completed and the cp node never uncordoned, it keeps rebooting in the last recorded state. Even if not immediately upgrading the cp node, I recommend uncordoning the cp node (step 16), watch the cluster's state and the kube-system namespace workload stabilize before attempting another upgrade of the cluster. Make sure you complete the upgrade of all components on both nodes - cp and worker, before shutting down/rebooting the VMs. Essentially, ensure your cluster is fully operational before shutting it down.

The second concern is the disk space warning issued on each EC2 instance. Ensure that each instance has at least 20 GB disk, perhaps increase to 30 GB to be on the safe side. Also, make sure both EC2 instances are inside the same SG, not in separate (distinct) SGs.

Regards,
-Chris

jchen4520 · October 5

Hi Chris, thanks for the help, I tried uncordoning the cp but it didn't seem to help as CrashLoopBackOff state persisted. I ended up remaking the instances from scratch and had no problems completing Lab 4. Hopefully it doesn't happen again the next time I shut down instances for a lab. Thanks!

Cilium stuck in CrashLoopBackOff after restarting AWS instances

Comments

Categories

Upcoming Training

Kubernetes Administration (LFS458)

Linux System Administration (LFS301)

Open Source Virtualization (LFS462)

Linux Kernel Debugging and Security (LFD440)