Cilium stuck in CrashLoopBackOff after restarting AWS instances

I completed section 3 labs without a problem following the provided material. I used AWS to spin up the virtual machines.
I stopped the AWS EC2 instances as they were expensive to run when I wasn't using them for labs. However, once I came back to work on section 4 labs, I discovered in lab 4.1 that my cluster was failing the upgrade health check.
On investigating this, I saw Cilium related pods on my worker node constantly in CrashLoopBackOff state, with an increasing number of resets. Over time, even the kube-proxy on the worker node entered CrashLoopBackOff state.
I've tried to fix this by removing the worker node from the cluster, removing all the Cilium-related directories and info there, and then readding the worker node to the cluster. However, the problem persists.
Given I've just started the course, I'm not sure what I should do next to fix this. Has anyone else had this issue?
Comments
-
I've attached some of the output from describing pods and system event logs.
0 -
Hi @jchen4520,
Are your EC2 instances preserving their private IP addresses between reboots?
What are the sizes of your instances - CPU, RAM, disk?
Is the SecurityGroup protecting your VMs allowing all ingres (inbound) traffic - all protocols, from all sources, to all port destinations?Regards,
-Chris0 -
Hi @chrispokorni,
As far as I can tell they are preserving their private IP addresses between reboots - although I noticed the first time I tried to troubleshoot (ran
hostname -i
on the worker) there were two IP addresses, the original IP and a 192.XX.XX.XX IP. After I ran it again a few minutes later it returned only the original IP, but I wasn't sure if that was expected as part of AWS restarting the instance.The instances are m7i-flex.large: 2vCPU, 8GiB memory. I believe the storage is 20GiB. It's running ubuntu/images/hvm-ssd-gp3/ubuntu-noble-24.04-amd64-server-20250821.
I also checked the security group rule and I have allow all ports, all protocols, 0.0.0.0/0 source for inbound traffic. Similarly, outbound traffic allows all ports, all protocols, 0.0.0.0/0 destination for outbound traffic.
0 -
Hi @jchen4520,
Your EC2 instances seem to be adequately sized, the OS and SG as recommended.
It is important that the EC2 instances preserve their original private IP addresses, the ones in the 172 range. So if the cp instance original private IP was 172.x.y.6, after reboot it should have the same 172.x.y.6 private IP.
A 192 IP address was assigned to a new interface on each EC2 instance respectively when the Cilium plugin initialized the CNI network. This IP missing after reboot means the CNI plugin no longer operates as expected.What are the entries of the
hosts
files on each EC2 instance?
What are the outputs of the following commands:kubectl get nodes -o wide kubectl events -A kubectl -n kube-system describe pod <pods-in-CrashLoopBackOff>
Also, assuming the name of the etcd pod is
etcd-cp
:kubectl -n kube-system exec etcd-cp -- \ cat /etc/kubernetes/pki/apiserver.crt | \ openssl x509 -noout -text | \ grep -A1 -i alt
Regards,
-Chris0 -
Hosts file for control plane:
172.31.15.6 k8scp 172.31.15.6 cp 127.0.0.1 localhost # The following lines are desirable for IPv6 capable hosts ::1 ip6-localhost ip6-loopback fe00::0 ip6-localnet ff00::0 ip6-mcastprefix ff02::1 ip6-allnodes ff02::2 ip6-allrouters ff02::3 ip6-allhosts
Hosts file for worker is line by line the exact same file.
1.
kubectl get nodes -o wide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME cp Ready,SchedulingDisabled control-plane 24d v1.32.1 172.31.15.6 <none> Ubuntu 24.04.3 LTS 6.14.0-1012-aws containerd://1.7.27 ip-172-31-15-167 Ready <none> 9d v1.32.1 172.31.15.167 <none> Ubuntu 24.04.3 LTS 6.14.0-1012-aws containerd://1.7.27
2 & 3 I will attach as text files as output is large. for 2 I only put the logs from the most recent reboot as otherwise it would be even longer.
For etcd-cp, the command errored out (I think because cat was not on the etcd-cp?) but I ran it just on the cp and got this:
sudo openssl x509 -in /etc/kubernetes/pki/apiserver.crt -noout -text | grep -A1 -i alt X509v3 Subject Alternative Name: DNS:cp, DNS:k8scp, DNS:kubernetes, DNS:kubernetes.default, DNS:kubernetes.default.svc, DNS:kubernetes.default.svc.cluster.local, IP Address:10.96.0.1, IP Address:172.31.15.6
0 -
Hi @jchen4520,
The first concerning item is the cp node's state. The
SchedulingDisabled
should be fixed first on the cp, to allow the cluster to launch all necessary components on the cp node.Did you successfully drain the cp node, as instructed in lab 4.1 step 7? Did you have a chance to upgrade kubeadm to v1.33.1 per step 9? Did you perhaps reboot the cp node before reaching step 16?
If the upgrade was not completed and the cp node never uncordoned, it keeps rebooting in the last recorded state. Even if not immediately upgrading the cp node, I recommend uncordoning the cp node (step 16), watch the cluster's state and the kube-system namespace workload stabilize before attempting another upgrade of the cluster. Make sure you complete the upgrade of all components on both nodes - cp and worker, before shutting down/rebooting the VMs. Essentially, ensure your cluster is fully operational before shutting it down.
The second concern is the disk space warning issued on each EC2 instance. Ensure that each instance has at least 20 GB disk, perhaps increase to 30 GB to be on the safe side. Also, make sure both EC2 instances are inside the same SG, not in separate (distinct) SGs.
Regards,
-Chris0 -
Hi Chris, thanks for the help, I tried uncordoning the cp but it didn't seem to help as CrashLoopBackOff state persisted. I ended up remaking the instances from scratch and had no problems completing Lab 4. Hopefully it doesn't happen again the next time I shut down instances for a lab. Thanks!
0
Categories
- All Categories
- 158 LFX Mentorship
- 158 LFX Mentorship: Linux Kernel
- 848 Linux Foundation IT Professional Programs
- 389 Cloud Engineer IT Professional Program
- 186 Advanced Cloud Engineer IT Professional Program
- 87 DevOps Engineer IT Professional Program
- 155 Cloud Native Developer IT Professional Program
- 152 Express Training Courses & Microlearning
- 149 Express Courses - Discussion Forum
- 3 Microlearning - Discussion Forum
- 7.1K Training Courses
- 50 LFC110 Class Forum - Discontinued
- 74 LFC131 Class Forum
- 56 LFD102 Class Forum
- 256 LFD103 Class Forum
- 26 LFD110 Class Forum
- 50 LFD121 Class Forum
- 3 LFD123 Class Forum
- 1 LFD125 Class Forum
- 19 LFD133 Class Forum
- 10 LFD134 Class Forum
- 19 LFD137 Class Forum
- 1 LFD140 Class Forum
- 73 LFD201 Class Forum
- 8 LFD210 Class Forum
- 6 LFD210-CN Class Forum
- 2 LFD213 Class Forum - Discontinued
- LFD221 Class Forum
- 128 LFD232 Class Forum - Discontinued
- 3 LFD233 Class Forum
- 5 LFD237 Class Forum
- 25 LFD254 Class Forum
- 746 LFD259 Class Forum
- 111 LFD272 Class Forum - Discontinued
- 4 LFD272-JP クラス フォーラム - Discontinued
- 16 LFD273 Class Forum
- 438 LFS101 Class Forum
- 3 LFS111 Class Forum
- 4 LFS112 Class Forum
- 5 LFS116 Class Forum
- 9 LFS118 Class Forum
- 2 LFS120 Class Forum
- 11 LFS142 Class Forum
- 9 LFS144 Class Forum
- 6 LFS145 Class Forum
- 6 LFS146 Class Forum
- 7 LFS147 Class Forum
- 21 LFS148 Class Forum
- 17 LFS151 Class Forum
- 6 LFS157 Class Forum
- 89 LFS158 Class Forum
- 1 LFS158-JP クラス フォーラム
- 14 LFS162 Class Forum
- 2 LFS166 Class Forum - Discontinued
- 9 LFS167 Class Forum
- 4 LFS170 Class Forum
- 2 LFS171 Class Forum - Discontinued
- 4 LFS178 Class Forum - Discontinued
- 4 LFS180 Class Forum
- 3 LFS182 Class Forum
- 7 LFS183 Class Forum
- 2 LFS184 Class Forum
- 41 LFS200 Class Forum
- 737 LFS201 Class Forum - Discontinued
- 3 LFS201-JP クラス フォーラム - Discontinued
- 22 LFS203 Class Forum
- 141 LFS207 Class Forum
- 3 LFS207-DE-Klassenforum
- 3 LFS207-JP クラス フォーラム
- 302 LFS211 Class Forum - Discontinued
- 56 LFS216 Class Forum - Discontinued
- 60 LFS241 Class Forum
- 51 LFS242 Class Forum
- 39 LFS243 Class Forum
- 17 LFS244 Class Forum
- 8 LFS245 Class Forum
- 1 LFS246 Class Forum
- 1 LFS248 Class Forum
- 122 LFS250 Class Forum
- 3 LFS250-JP クラス フォーラム
- 2 LFS251 Class Forum
- 163 LFS253 Class Forum
- 1 LFS254 Class Forum - Discontinued
- 3 LFS255 Class Forum
- 15 LFS256 Class Forum
- 2 LFS257 Class Forum
- 1.4K LFS258 Class Forum
- 12 LFS258-JP クラス フォーラム
- 140 LFS260 Class Forum
- 165 LFS261 Class Forum
- 44 LFS262 Class Forum
- 82 LFS263 Class Forum - Discontinued
- 15 LFS264 Class Forum - Discontinued
- 11 LFS266 Class Forum - Discontinued
- 25 LFS267 Class Forum
- 27 LFS268 Class Forum
- 38 LFS269 Class Forum
- 11 LFS270 Class Forum
- 202 LFS272 Class Forum - Discontinued
- 2 LFS272-JP クラス フォーラム - Discontinued
- 2 LFS274 Class Forum - Discontinued
- 4 LFS281 Class Forum - Discontinued
- 30 LFW111 Class Forum
- 265 LFW211 Class Forum
- 190 LFW212 Class Forum
- 17 SKF100 Class Forum
- 2 SKF200 Class Forum
- 3 SKF201 Class Forum
- 800 Hardware
- 200 Drivers
- 68 I/O Devices
- 37 Monitors
- 104 Multimedia
- 175 Networking
- 92 Printers & Scanners
- 85 Storage
- 765 Linux Distributions
- 82 Debian
- 67 Fedora
- 20 Linux Mint
- 13 Mageia
- 23 openSUSE
- 149 Red Hat Enterprise
- 31 Slackware
- 13 SUSE Enterprise
- 356 Ubuntu
- 472 Linux System Administration
- 39 Cloud Computing
- 71 Command Line/Scripting
- Github systems admin projects
- 96 Linux Security
- 78 Network Management
- 102 System Management
- 48 Web Management
- 74 Mobile Computing
- 19 Android
- 42 Development
- 1.2K New to Linux
- 1K Getting Started with Linux
- 387 Off Topic
- 120 Introductions
- 178 Small Talk
- 28 Study Material
- 846 Programming and Development
- 311 Kernel Development
- 517 Software Development
- 1.8K Software
- 267 Applications
- 183 Command Line
- 5 Compiling/Installing
- 988 Games
- 318 Installation
- 108 All In Program
- 108 All In Forum
Upcoming Training
-
August 20, 2018
Kubernetes Administration (LFS458)
-
August 20, 2018
Linux System Administration (LFS301)
-
August 27, 2018
Open Source Virtualization (LFS462)
-
August 27, 2018
Linux Kernel Debugging and Security (LFD440)