Cilium Crash-Loop on worker nodes

Hi,
I recently setup my control-plane node. It works as intended. Everything OK!
I added cilium as my networking pod. OK!
I now join my worker nodes: worker1 and worker2 OK!
Cilium initialize, and then... Init:CrashLoopBackoff !
If Cilium on my control-plane works, why would my worker nodes fail?
Both worker are a carbon copy of the cp node, before I initialized the network.
Feel free to ask questions, or offer machine-breaking solutions. I use snapshots to reset my VMs if anything goes wrong. I am quite stuck and I am unsure on how to troubleshoot this situation. I believe there might be a simple connectivity issue between my workers and the cp.
More information:
Using VirtualBox, I created my own control plane with two worker nodes.
They use NAT for a direct internet connection, and a host only adapter for inter-node communication- Host only adapter interfaces:
- cp: 192.168.100.12
- worker1: 192.168.100.21
- worker2: 192.168.100.22
ping cp ↔ worker1 ↔ worker2 OK
All nodes use:
- kubectl, kubeadm, kubelet V1.32.2
containerd for the runtime.
- use systemd : OK
- updated pause:3.8 image to pause:3.10 image : OK
hostname set
cp cat /etc/hosts
127.0.0.1 localhost
192.168.100.12 k8scp cp
cat /etc/hostname
cpworker1 cat /etc/hosts
127.0.0.1 localhost
192.168.100.12 k8scp
192.168.100.21 worker1
cat /etc/hostname
worker1worker2 cat /etc/hosts
127.0.0.1 localhost
192.168.100.12 k8scp
192.168.100.22 worker2
cat /etc/hostname
worker2
I used config files to init my network and join my nodes:
- cp: init-defaults.yaml (I had to save it as txt to updload it to the forum)
- worker1: init-defaults.yaml
Notice: "unsafeSkipCAVerification: true". The Host-Only Adapter Network cannot be reached from the internet. It's safe!
Comments
-
Here is my status before joining worker nodes:
kubectl get pods -ANAMESPACE NAME READY STATUS RESTARTS AGE kube-system cilium-envoy-dsjhg 1/1 Running 2 (13m ago) 5d19h kube-system cilium-f559g 1/1 Running 2 (13m ago) 5d19h kube-system cilium-operator-f4589588d-w9jpf 1/1 Running 2 (13m ago) 5d19h kube-system coredns-668d6bf9bc-kkpgn 1/1 Running 2 (13m ago) 5d21h kube-system coredns-668d6bf9bc-nl4mk 1/1 Running 2 (13m ago) 5d21h kube-system etcd-cp 1/1 Running 3 (13m ago) 5d21h kube-system kube-apiserver-cp 1/1 Running 3 (13m ago) 5d21h kube-system kube-controller-manager-cp 1/1 Running 3 (13m ago) 5d21h kube-system kube-proxy-vr2js 1/1 Running 3 (13m ago) 5d21h kube-system kube-scheduler-cp 1/1 Running 3 (13m ago) 5d21h Here is my status after joining worker node 1 (notice the Init:0/6, it will crash-loop or Init:Error):
kubectl get pods -ANAMESPACE NAME READY STATUS RESTARTS AGE kube-system cilium-envoy-dsjhg 1/1 Running 2 (18m ago) 5d19h kube-system cilium-envoy-qt4j2 1/1 Running 0 3m53s kube-system cilium-f559g 1/1 Running 2 (18m ago) 5d19h kube-system cilium-operator-f4589588d-w9jpf 1/1 Running 2 (18m ago) 5d19h kube-system cilium-vlpq5 0/1 Init:0/6 2 (60s ago) 3m53s kube-system coredns-668d6bf9bc-kkpgn 1/1 Running 2 (18m ago) 5d21h kube-system coredns-668d6bf9bc-nl4mk 1/1 Running 2 (18m ago) 5d21h kube-system etcd-cp 1/1 Running 3 (18m ago) 5d21h kube-system kube-apiserver-cp 1/1 Running 3 (18m ago) 5d21h kube-system kube-controller-manager-cp 1/1 Running 3 (18m ago) 5d21h kube-system kube-proxy-dtvrn 1/1 Running 0 3m53s kube-system kube-proxy-vr2js 1/1 Running 3 (18m ago) 5d21h kube-system kube-scheduler-cp 1/1 Running 3 (18m ago) 5d21h UPDATE:
| kube-system | cilium-vlpq5 | 0/1 | Init:Error | 2 (60s ago) | 3m53s |0 -
Here is the description of the failing pod !
Notice the event:Warning BackOff 57s (x14 over 6m58s) kubelet Back-off restarting failed container config in pod cilium-vlpq5_kube-system(29d95fb8-38eb-4d33-9091-287827c2e73b)
0 -
-
Hi @fasmy,
Is this content related in any way with the LFS253 - Containers Fundamentals course? This is the forum for LFS253...
Let me know if this was not your intent, and eventually what course this relates to so I can move this entire thread to the forum it belongs to.
Regards,
-Chris0 -
I am very sorry about that. It is related to LFS258, I think I'll update my shortcuts now!
0 -
Hi @fasmy,
Thank you for confirming the forum. I moved this thread to the LFS258 forum.
By taking a quick look at the describe details, I see a few timeout errors.
It is not yet clear what causes them, so we'd need to evaluate the state of your cluster overall.Please provide the following:
- VM OS distribution and release,
- CPU count,
- RAM,
- disk size (whether dynamically allocated of fully pre-allocated),
- how many network interfaces and types (nat, bridge, host, ...),
- IP addresses assigned to each interface,
- type of VM - local hypervisor or cloud,
- how are firewalls setup by the hypervisor, or the cloud VPC with firewalls/SG, etc...
In addition, please provide outputs of:
- kubectl get nodes -o wide
- kubectl get pods -A -o wide
- kubectl get svc,ep -A -o wide
Assuming you followed the instructions as provided, and that
kubeadm-config.yaml
manifest is copied into/root/
, and it was used to initialize the cluster according to the lab guide, what is the value ofpodSubnet
parameter:- grep podSubnet /root/kubeadm-config.yaml
Update the path to the
cilium-cni.yaml
manifest if needed, and provide the value of thecluster-pool-ipv4-cidr
parameter, again assuming that the manifest was used to launch the Cilium CNI plugin according to the lab guide:- grep cluster-pool-ipv4-cidr /home/student/LFS258/SOLUTIONS/s_03/cilium-cni.yaml
Regards,
-Chris0 -
- VM OS distribution and release: Oracle VirtualBox → ubuntu-24.04.1-live-server-amd64.iso
- CPU count: 2
- RAM: 4096 MB
- disk size: 12 GB, Dynamically allocated differencing storage
how many network interfaces and types:
VirtualBox config:- Adapter1: Default NAT (port forwarding: kubeAPI 6443→6443, ssh 222X→22, replace X with {cp=2, worker1=3, worker2=4})
- Adapter2: Host-only Adapter (created in VirtualBox Networks) on 192.168.100.1/24, DHCP Disabled
cp, worker1, and worker2 interfaces (before cilium):
- lo (loopback) 127.0.0.1/8
- enp0s3 (NAT) 10.0.2.15/24 Note: cp, worker1, and worker2 use the same address.
- enp0s8 (Host-Only) 192.168.100.XX
- cp → 11 12 (new cp)
- worker1 → 21
- worker2 → 22
cp interfaces (after cilium):
- Please find the
ip a
dump in the above-provided filecp-ip-tables.txt
- Note: Only cp has a new interfaces. No changes on worker nodes.
- Cilium uses ipv6! I didn't config my network to use IPv6. The issue most likely lies around here. I might have to rebuild my VMS (from snapshot). But before restarting again, I would appreciate confirmation that I am not misguided.
IP addresses assigned to each interface
I believe I had issues understanding the intricacies of Kubernetes networking. As I understand, there are 3 address types that we must setup and avoid collision with, but I couldn't cite them.. I know I used 192.168.100.X to connect my nodes to the kubeAPI-server, as seen ininit-defaults.txt
(file above), but I do wonder where I could config the other two types of addresses. In the Networking Pod I suppose.type of VM: local hypervisor or cloud,
- how are firewalls setup? : None! I have no firewalls, I double checked.
0 -
- kubectl get nodes -o wide
- NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
- cp Ready control-plane 7d v1.32.2 192.168.100.12 <none> Ubuntu 24.04.1 LTS 6.8.0-54-generic containerd://1.7.24
- worker1 NotReady <none> 27h v1.32.2 192.168.100.21 <none> Ubuntu 24.04.1 LTS 6.8.0-54-generic containerd://1.7.24
0 -
- kubectl get pods -A -o wide
- NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
- kube-system cilium-envoy-dsjhg 1/1 Running 3 (6h27m ago) 6d23h 192.168.100.12 cp <none> <none>
- kube-system cilium-envoy-qt4j2 1/1 Running 1 (6h34m ago) 27h 192.168.100.21 worker1 <none> <none>
- kube-system cilium-f559g 1/1 Running 3 (6h27m ago) 6d23h 192.168.100.12 cp <none> <none>
- kube-system cilium-operator-f4589588d-w9jpf 1/1 Running 3 (6h27m ago) 6d23h 192.168.100.12 cp <none> <none>
- kube-system cilium-vlpq5 0/1 Init:CrashLoopBackOff 14 (6h1m ago) 27h 192.168.100.21 worker1 <none> <none>
- kube-system coredns-668d6bf9bc-kkpgn 1/1 Running 3 (6h27m ago) 7d 10.0.0.194 cp <none> <none>
- kube-system coredns-668d6bf9bc-nl4mk 1/1 Running 3 (6h27m ago) 7d 10.0.0.209 cp <none> <none>
- kube-system etcd-cp 1/1 Running 4 (6h27m ago) 7d 192.168.100.12 cp <none> <none>
- kube-system kube-apiserver-cp 1/1 Running 4 (6h27m ago) 7d 192.168.100.12 cp <none> <none>
- kube-system kube-controller-manager-cp 1/1 Running 4 (6h27m ago) 7d 192.168.100.12 cp <none> <none>
- kube-system kube-proxy-dtvrn 1/1 Running 1 (6h34m ago) 27h 192.168.100.21 worker1 <none> <none>
- kube-system kube-proxy-vr2js 1/1 Running 4 (6h27m ago) 7d 192.168.100.12 cp <none> <none>
- kube-system kube-scheduler-cp 1/1 Running 4 (6h27m ago) 7d 192.168.100.12 cp <none> <none>
0 -
- kubectl get svc,ep -A -o wide
- NAMESPACE NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR
- default service/kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 7d <none>
- kube-system service/cilium-envoy ClusterIP None <none> 9964/TCP 6d23h k8s-app=cilium-envoy
- kube-system service/hubble-peer ClusterIP 10.107.23.234 <none> 443/TCP 6d23h k8s-app=cilium
- kube-system service/kube-dns ClusterIP 10.96.0.10 <none> 53/UDP,53/TCP,9153/TCP 7d k8s-app=kube-dns
- NAMESPACE NAME ENDPOINTS AGE
- default endpoints/kubernetes 192.168.100.12:6443 7d
- kube-system endpoints/cilium-envoy 192.168.100.12:9964,192.168.100.21:9964 6d23h
- kube-system endpoints/hubble-peer 192.168.100.12:4244 6d23h
- kube-system endpoints/kube-dns 10.0.0.194:53,10.0.0.209:53,10.0.0.194:53 + 3 more... 7d
0 -
There is no podSubnet to my config file. I used the config with
kubeadm init --config ~/config/init-defaults.yaml
.
Find above: init-defaults.txt- grep podSubnet ~/config/init-defaults.yaml
- # Nothing
I id config ControlPlaneEndpoint: k8scp:6443,
and InitConfiguration's LocalAPIEndpoint{advertiseAddress: 192.168.100.12, bindPort: 6443}
0 -
I followed Cilium's default instatlation (Generic):
https://docs.cilium.io/en/stable/gettingstarted/k8s-install-default/Sorry for the piecemeal answer, the Forum's protection is going off on me, preventing me from posting. It was due to Cilium's command lines.
0 -
As far as I understand:
My pods communicate IP to IP: 192.100.0.12 (cp), 192.100.0.21 (worker1).
Within this network, the network Pod, Cilium, creates a virtual network where is assigned the ip
10.96.0.1 as the api server, and uses the port:443 (instead of 6443!).But for some reason, Cilium cannot connect to the api-server from a worker node. The cilium pod in cp works:
- kubectl get pods -n kube-system -o wide
- NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
- ... # ↓↓↓ cp cilium pod ↓↓↓
- cilium-f559g 1/1 Running 6 (14m ago) 7d21h 192.168.100.12 cp <none> <none>
- ... # ↓↓↓ worker1 cilium pod ↓↓↓
- cilium-x6v75 0/1 Init:CrashLoopBackOff 9 (113s ago) 33m 192.168.100.21
- ...
- kubectl describe pod -n kube-system cilium-x6v75
- ...
- Node: worker1/192.168.100.21
- ...
- Status: Pending
- IP: 192.168.100.21
- IPs:
- IP: 192.168.100.21
- Controlled By: DaemonSet/cilium
- Init Containers:
- config:
- Container ID: containerd://4e3a2d50a0a18b79f60d4730ab4572bd23390407f4c9d28055ada37d15047ae6
- Image: quay.io/cilium/cilium:v1.17.1@sha256:8969bfd9c87cbea91e40665f8ebe327268c99d844ca26d7d12165de07f702866
- Image ID: quay.io/cilium/cilium@sha256:8969bfd9c87cbea91e40665f8ebe327268c99d844ca26d7d12165de07f702866
- Port: <none>
- Host Port: <none>
- Command:
- cilium-dbg
- build-config
- State: Waiting
- Reason: CrashLoopBackOff
- Last State: Terminated
- Reason: Error
- Message: Running
- 2025/03/09 16:13:35 INFO Starting hive
- time="2025-03-09T16:13:35.42543063Z" level=info msg="Establishing connection to apiserver" host="https://10.96.0.1:443" subsys=k8s-client
- time="2025-03-09T16:14:10.447745411Z" level=info msg="Establishing connection to apiserver" host="https://10.96.0.1:443" subsys=k8s-client
- time="2025-03-09T16:14:40.46315002Z" level=error msg="Unable to contact k8s api-server" error="Get \"https://10.96.0.1:443/api/v1/namespaces/kube-system\": dial tcp 10.96.0.1:443: i/o timeout" ipAddr="https://10.96.0.1:443" subsys=k8s-client
- 2025/03/09 16:14:40 ERROR Start hook failed function="client.(*compositeClientset).onStart (k8s-client)" error="Get \"https://10.96.0.1:443/api/v1/namespaces/kube-system\": dial tcp 10.96.0.1:443: i/o timeout"
- 2025/03/09 16:14:40 ERROR Start failed error="Get \"https://10.96.0.1:443/api/v1/namespaces/kube-system\": dial tcp 10.96.0.1:443: i/o timeout" duration=1m5.037880326s
- 2025/03/09 16:14:40 INFO Stopping
- Error: Build config failed: failed to start: Get "https://10.96.0.1:443/api/v1/namespaces/kube-system": dial tcp 10.96.0.1:443: i/o timeout
- ...
0 -
Hi @fasmy,
The describe of the cilium pod shows several timeouts. However, from your cluster's details it seems you have underprovisioned your VirtualBox VMs. For an operational cluster on VBox, each VM needs 2 CPU cores, 8 GB RAM, 20+ GB vdisk fully provisioned. Dynamically provisioned disk causes your kubelets to panic and as a result they are not deploying critical workload on the nodes.
However, I cannot make any comments on the cluster bootstrapping and the Cilium CNI installation. From your detailed notes it seems that you deviated from the lab guide and disregarded the installation, configuration and bootstrapping steps provided to you. A custom install is outside of the course's scope, and troubleshooting such a custom installation is out of the scope for the course's support staff. This forum is dedicated to the content of the course, both lectures and lab exercises. I highly recommend following the lab guide in order to maximize your chances of successfully completing the lab exercises.
Regards,
-Chris0 -
Understood ! I'll do the course install.
I am glad I did a full install based on the official, up-to-date documentation, as I learned a lot troubleshooting it on my own (Linux, Kubernetes). However, I am indeed getting a bit tired.
I know this very last hurdle has everything to do with Cilium. I used another CNI and it works like a charm.
I wanted to ensure I could do it on my own, with whatever setup I might need. Tutorials made to work first time often let you stuck in practice. Now I know... and I did it!
0
Categories
- All Categories
- 232 LFX Mentorship
- 232 LFX Mentorship: Linux Kernel
- 812 Linux Foundation IT Professional Programs
- 365 Cloud Engineer IT Professional Program
- 183 Advanced Cloud Engineer IT Professional Program
- 82 DevOps Engineer IT Professional Program
- 151 Cloud Native Developer IT Professional Program
- 140 Express Training Courses & Microlearning
- 140 Express Courses - Discussion Forum
- Microlearning - Discussion Forum
- 6.4K Training Courses
- 48 LFC110 Class Forum - Discontinued
- 71 LFC131 Class Forum
- 46 LFD102 Class Forum
- 229 LFD103 Class Forum
- 20 LFD110 Class Forum
- 44 LFD121 Class Forum
- LFD125 Class Forum
- 18 LFD133 Class Forum
- 8 LFD134 Class Forum
- 18 LFD137 Class Forum
- 71 LFD201 Class Forum
- 5 LFD210 Class Forum
- 5 LFD210-CN Class Forum
- 2 LFD213 Class Forum - Discontinued
- 128 LFD232 Class Forum - Discontinued
- 2 LFD233 Class Forum
- 4 LFD237 Class Forum
- 24 LFD254 Class Forum
- 711 LFD259 Class Forum
- 111 LFD272 Class Forum - Discontinued
- 4 LFD272-JP クラス フォーラム
- 13 LFD273 Class Forum
- 200 LFS101 Class Forum
- 1 LFS111 Class Forum
- 3 LFS112 Class Forum
- 3 LFS116 Class Forum
- 7 LFS118 Class Forum
- LFS120 Class Forum
- 9 LFS142 Class Forum
- 8 LFS144 Class Forum
- 4 LFS145 Class Forum
- 3 LFS146 Class Forum
- 15 LFS148 Class Forum
- 15 LFS151 Class Forum
- 5 LFS157 Class Forum
- 49 LFS158 Class Forum
- LFS158-JP クラス フォーラム
- 10 LFS162 Class Forum
- 2 LFS166 Class Forum
- 4 LFS167 Class Forum
- 3 LFS170 Class Forum
- 2 LFS171 Class Forum
- 3 LFS178 Class Forum
- 3 LFS180 Class Forum
- 2 LFS182 Class Forum
- 5 LFS183 Class Forum
- 33 LFS200 Class Forum
- 737 LFS201 Class Forum - Discontinued
- 3 LFS201-JP クラス フォーラム - Discontinued
- 19 LFS203 Class Forum
- 135 LFS207 Class Forum
- 2 LFS207-DE-Klassenforum
- 2 LFS207-JP クラス フォーラム
- 302 LFS211 Class Forum
- 56 LFS216 Class Forum
- 52 LFS241 Class Forum
- 50 LFS242 Class Forum
- 38 LFS243 Class Forum
- 16 LFS244 Class Forum
- 5 LFS245 Class Forum
- LFS246 Class Forum
- LFS248 Class Forum
- 54 LFS250 Class Forum
- 2 LFS250-JP クラス フォーラム
- 1 LFS251 Class Forum
- 156 LFS253 Class Forum
- 1 LFS254 Class Forum
- 1 LFS255 Class Forum
- 10 LFS256 Class Forum
- 1 LFS257 Class Forum
- 1.3K LFS258 Class Forum
- 11 LFS258-JP クラス フォーラム
- 134 LFS260 Class Forum
- 160 LFS261 Class Forum
- 43 LFS262 Class Forum
- 82 LFS263 Class Forum - Discontinued
- 15 LFS264 Class Forum - Discontinued
- 11 LFS266 Class Forum - Discontinued
- 24 LFS267 Class Forum
- 25 LFS268 Class Forum
- 32 LFS269 Class Forum
- 6 LFS270 Class Forum
- 202 LFS272 Class Forum - Discontinued
- 2 LFS272-JP クラス フォーラム
- 4 LFS147 Class Forum
- 1 LFS274 Class Forum
- 4 LFS281 Class Forum
- 15 LFW111 Class Forum
- 262 LFW211 Class Forum
- 184 LFW212 Class Forum
- 15 SKF100 Class Forum
- 1 SKF200 Class Forum
- 2 SKF201 Class Forum
- 797 Hardware
- 199 Drivers
- 68 I/O Devices
- 37 Monitors
- 104 Multimedia
- 174 Networking
- 91 Printers & Scanners
- 85 Storage
- 759 Linux Distributions
- 82 Debian
- 67 Fedora
- 17 Linux Mint
- 13 Mageia
- 23 openSUSE
- 148 Red Hat Enterprise
- 31 Slackware
- 13 SUSE Enterprise
- 354 Ubuntu
- 470 Linux System Administration
- 39 Cloud Computing
- 71 Command Line/Scripting
- Github systems admin projects
- 95 Linux Security
- 78 Network Management
- 102 System Management
- 47 Web Management
- 69 Mobile Computing
- 18 Android
- 38 Development
- 1.2K New to Linux
- 1K Getting Started with Linux
- 377 Off Topic
- 115 Introductions
- 175 Small Talk
- 26 Study Material
- 807 Programming and Development
- 304 Kernel Development
- 485 Software Development
- 1.8K Software
- 263 Applications
- 183 Command Line
- 3 Compiling/Installing
- 988 Games
- 317 Installation
- 103 All In Program
- 103 All In Forum
Upcoming Training
-
August 20, 2018
Kubernetes Administration (LFS458)
-
August 20, 2018
Linux System Administration (LFS301)
-
August 27, 2018
Open Source Virtualization (LFS462)
-
August 27, 2018
Linux Kernel Debugging and Security (LFD440)