Cilium Crash-Loop on worker nodes

fasmy · March 7

Hi,

I recently setup my control-plane node. It works as intended. Everything OK!

I added cilium as my networking pod. OK!

I now join my worker nodes: worker1 and worker2 OK!

Cilium initialize, and then... Init:CrashLoopBackoff !

If Cilium on my control-plane works, why would my worker nodes fail?
Both worker are a carbon copy of the cp node, before I initialized the network.

Feel free to ask questions, or offer machine-breaking solutions. I use snapshots to reset my VMs if anything goes wrong. I am quite stuck and I am unsure on how to troubleshoot this situation. I believe there might be a simple connectivity issue between my workers and the cp.

More information:

Using VirtualBox, I created my own control plane with two worker nodes.
They use NAT for a direct internet connection, and a host only adapter for inter-node communication- Host only adapter interfaces:

cp: 192.168.100.12
worker1: 192.168.100.21
worker2: 192.168.100.22

ping cp ↔ worker1 ↔ worker2 OK

All nodes use:

kubectl, kubeadm, kubelet V1.32.2
containerd for the runtime.
- use systemd : OK
- updated pause:3.8 image to pause:3.10 image : OK
hostname set
- cp cat /etc/hosts
  127.0.0.1 localhost
  192.168.100.12 k8scp cp
  cat /etc/hostname
  cp
- worker1 cat /etc/hosts
  127.0.0.1 localhost
  192.168.100.12 k8scp
  192.168.100.21 worker1
  cat /etc/hostname
  worker1
- worker2 cat /etc/hosts
  127.0.0.1 localhost
  192.168.100.12 k8scp
  192.168.100.22 worker2
  cat /etc/hostname
  worker2

I used config files to init my network and join my nodes:

cp: init-defaults.yaml (I had to save it as txt to updload it to the forum)
worker1: init-defaults.yaml
Notice: "unsafeSkipCAVerification: true". The Host-Only Adapter Network cannot be reached from the internet. It's safe!

fasmy · March 7

Here is my status before joining worker nodes:
kubectl get pods -A

NAMESPACE	NAME	READY	STATUS	RESTARTS	AGE
kube-system	cilium-envoy-dsjhg	1/1	Running	2 (13m ago)	5d19h
kube-system	cilium-f559g	1/1	Running	2 (13m ago)	5d19h
kube-system	cilium-operator-f4589588d-w9jpf	1/1	Running	2 (13m ago)	5d19h
kube-system	coredns-668d6bf9bc-kkpgn	1/1	Running	2 (13m ago)	5d21h
kube-system	coredns-668d6bf9bc-nl4mk	1/1	Running	2 (13m ago)	5d21h
kube-system	etcd-cp	1/1	Running	3 (13m ago)	5d21h
kube-system	kube-apiserver-cp	1/1	Running	3 (13m ago)	5d21h
kube-system	kube-controller-manager-cp	1/1	Running	3 (13m ago)	5d21h
kube-system	kube-proxy-vr2js	1/1	Running	3 (13m ago)	5d21h
kube-system	kube-scheduler-cp	1/1	Running	3 (13m ago)	5d21h

Here is my status after joining worker node 1 (notice the Init:0/6, it will crash-loop or Init:Error):
kubectl get pods -A

NAMESPACE	NAME	READY	STATUS	RESTARTS	AGE
kube-system	cilium-envoy-dsjhg	1/1	Running	2 (18m ago)	5d19h
kube-system	cilium-envoy-qt4j2	1/1	Running	0	3m53s
kube-system	cilium-f559g	1/1	Running	2 (18m ago)	5d19h
kube-system	cilium-operator-f4589588d-w9jpf	1/1	Running	2 (18m ago)	5d19h
kube-system	cilium-vlpq5	0/1	Init:0/6	2 (60s ago)	3m53s
kube-system	coredns-668d6bf9bc-kkpgn	1/1	Running	2 (18m ago)	5d21h
kube-system	coredns-668d6bf9bc-nl4mk	1/1	Running	2 (18m ago)	5d21h
kube-system	etcd-cp	1/1	Running	3 (18m ago)	5d21h
kube-system	kube-apiserver-cp	1/1	Running	3 (18m ago)	5d21h
kube-system	kube-controller-manager-cp	1/1	Running	3 (18m ago)	5d21h
kube-system	kube-proxy-dtvrn	1/1	Running	0	3m53s
kube-system	kube-proxy-vr2js	1/1	Running	3 (18m ago)	5d21h
kube-system	kube-scheduler-cp	1/1	Running	3 (18m ago)	5d21h

fasmy · March 7

Here is the description of the failing pod !
Notice the event:

Warning BackOff 57s (x14 over 6m58s) kubelet Back-off restarting failed container config in pod cilium-vlpq5_kube-system(29d95fb8-38eb-4d33-9091-287827c2e73b)

fasmy · March 7

I did not setup ipv6, so I wonder why cilium created ipv6 interfaces...

chrispokorni · March 7

Hi @fasmy,

Is this content related in any way with the LFS253 - Containers Fundamentals course? This is the forum for LFS253...

Let me know if this was not your intent, and eventually what course this relates to so I can move this entire thread to the forum it belongs to.

Regards,
-Chris

fasmy · March 7

I am very sorry about that. It is related to LFS258, I think I'll update my shortcuts now!

chrispokorni · March 7

Hi @fasmy,

Thank you for confirming the forum. I moved this thread to the LFS258 forum.

By taking a quick look at the describe details, I see a few timeout errors.
It is not yet clear what causes them, so we'd need to evaluate the state of your cluster overall.

Please provide the following:

VM OS distribution and release,
CPU count,
RAM,
disk size (whether dynamically allocated of fully pre-allocated),
how many network interfaces and types (nat, bridge, host, ...),
IP addresses assigned to each interface,
type of VM - local hypervisor or cloud,
how are firewalls setup by the hypervisor, or the cloud VPC with firewalls/SG, etc...

In addition, please provide outputs of:

kubectl get nodes -o wide
kubectl get pods -A -o wide
kubectl get svc,ep -A -o wide

Assuming you followed the instructions as provided, and that kubeadm-config.yaml manifest is copied into /root/, and it was used to initialize the cluster according to the lab guide, what is the value of podSubnet parameter:

grep podSubnet /root/kubeadm-config.yaml

Update the path to the cilium-cni.yaml manifest if needed, and provide the value of the cluster-pool-ipv4-cidr parameter, again assuming that the manifest was used to launch the Cilium CNI plugin according to the lab guide:

grep cluster-pool-ipv4-cidr /home/student/LFS258/SOLUTIONS/s_03/cilium-cni.yaml

Regards,
-Chris

fasmy · March 8

VM OS distribution and release: Oracle VirtualBox → ubuntu-24.04.1-live-server-amd64.iso
CPU count: 2
RAM: 4096 MB
disk size: 12 GB, Dynamically allocated differencing storage
how many network interfaces and types:
VirtualBox config:
- Adapter1: Default NAT (port forwarding: kubeAPI 6443→6443, ssh 222X→22, replace X with {cp=2, worker1=3, worker2=4})
- Adapter2: Host-only Adapter (created in VirtualBox Networks) on 192.168.100.1/24, DHCP Disabled
cp, worker1, and worker2 interfaces (before cilium):
- lo (loopback) 127.0.0.1/8
- enp0s3 (NAT) 10.0.2.15/24 Note: cp, worker1, and worker2 use the same address.
- enp0s8 (Host-Only) 192.168.100.XX
  - cp → 11 12 (new cp)
  - worker1 → 21
  - worker2 → 22
cp interfaces (after cilium):
- Please find the ip a dump in the above-provided file cp-ip-tables.txt
- Note: Only cp has a new interfaces. No changes on worker nodes.
- Cilium uses ipv6! I didn't config my network to use IPv6. The issue most likely lies around here. I might have to rebuild my VMS (from snapshot). But before restarting again, I would appreciate confirmation that I am not misguided.
IP addresses assigned to each interface
I believe I had issues understanding the intricacies of Kubernetes networking. As I understand, there are 3 address types that we must setup and avoid collision with, but I couldn't cite them.. I know I used 192.168.100.X to connect my nodes to the kubeAPI-server, as seen in init-defaults.txt (file above), but I do wonder where I could config the other two types of addresses. In the Networking Pod I suppose.
type of VM: local hypervisor or cloud,
how are firewalls setup? : None! I have no firewalls, I double checked.

fasmy · March 8

kubectl get nodes -o wide
    NAME      STATUS     ROLES           AGE   VERSION   INTERNAL-IP      EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION     CONTAINER-RUNTIME
    cp        Ready      control-plane   7d    v1.32.2   192.168.100.12   <none>        Ubuntu 24.04.1 LTS   6.8.0-54-generic   containerd://1.7.24
    worker1   NotReady   <none>          27h   v1.32.2   192.168.100.21   <none>        Ubuntu 24.04.1 LTS   6.8.0-54-generic   containerd://1.7.24

fasmy · March 8

kubectl get pods -A -o wide
    NAMESPACE     NAME                              READY   STATUS                  RESTARTS        AGE     IP               NODE      NOMINATED NODE   READINESS GATES
    kube-system   cilium-envoy-dsjhg                1/1     Running                 3 (6h27m ago)   6d23h   192.168.100.12   cp        <none>           <none>
    kube-system   cilium-envoy-qt4j2                1/1     Running                 1 (6h34m ago)   27h     192.168.100.21   worker1   <none>           <none>
    kube-system   cilium-f559g                      1/1     Running                 3 (6h27m ago)   6d23h   192.168.100.12   cp        <none>           <none>
    kube-system   cilium-operator-f4589588d-w9jpf   1/1     Running                 3 (6h27m ago)   6d23h   192.168.100.12   cp        <none>           <none>
    kube-system   cilium-vlpq5                      0/1     Init:CrashLoopBackOff   14 (6h1m ago)   27h     192.168.100.21   worker1   <none>           <none>
    kube-system   coredns-668d6bf9bc-kkpgn          1/1     Running                 3 (6h27m ago)   7d      10.0.0.194       cp        <none>           <none>
    kube-system   coredns-668d6bf9bc-nl4mk          1/1     Running                 3 (6h27m ago)   7d      10.0.0.209       cp        <none>           <none>
    kube-system   etcd-cp                           1/1     Running                 4 (6h27m ago)   7d      192.168.100.12   cp        <none>           <none>
    kube-system   kube-apiserver-cp                 1/1     Running                 4 (6h27m ago)   7d      192.168.100.12   cp        <none>           <none>
    kube-system   kube-controller-manager-cp        1/1     Running                 4 (6h27m ago)   7d      192.168.100.12   cp        <none>           <none>
    kube-system   kube-proxy-dtvrn                  1/1     Running                 1 (6h34m ago)   27h     192.168.100.21   worker1   <none>           <none>
    kube-system   kube-proxy-vr2js                  1/1     Running                 4 (6h27m ago)   7d      192.168.100.12   cp        <none>           <none>
    kube-system   kube-scheduler-cp                 1/1     Running                 4 (6h27m ago)   7d      192.168.100.12   cp        <none>           <none>

fasmy · March 8

kubectl get svc,ep -A -o wide
    NAMESPACE     NAME                   TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)                  AGE     SELECTOR
    default       service/kubernetes     ClusterIP   10.96.0.1       <none>        443/TCP                  7d      <none>
    kube-system   service/cilium-envoy   ClusterIP   None            <none>        9964/TCP                 6d23h   k8s-app=cilium-envoy
    kube-system   service/hubble-peer    ClusterIP   10.107.23.234   <none>        443/TCP                  6d23h   k8s-app=cilium
    kube-system   service/kube-dns       ClusterIP   10.96.0.10      <none>        53/UDP,53/TCP,9153/TCP   7d      k8s-app=kube-dns
 
    NAMESPACE     NAME                     ENDPOINTS                                               AGE
    default       endpoints/kubernetes     192.168.100.12:6443                                     7d
    kube-system   endpoints/cilium-envoy   192.168.100.12:9964,192.168.100.21:9964                 6d23h
    kube-system   endpoints/hubble-peer    192.168.100.12:4244                                     6d23h
    kube-system   endpoints/kube-dns       10.0.0.194:53,10.0.0.209:53,10.0.0.194:53 + 3 more...   7d

fasmy · March 8

There is no podSubnet to my config file. I used the config with kubeadm init --config ~/config/init-defaults.yaml.
Find above: init-defaults.txt

grep podSubnet ~/config/init-defaults.yaml
# Nothing

I id config ControlPlaneEndpoint: k8scp:6443,
and InitConfiguration's LocalAPIEndpoint {advertiseAddress: 192.168.100.12, bindPort: 6443}

fasmy · March 8

I followed Cilium's default instatlation (Generic):
https://docs.cilium.io/en/stable/gettingstarted/k8s-install-default/

Sorry for the piecemeal answer, the Forum's protection is going off on me, preventing me from posting. It was due to Cilium's command lines.

fasmy · March 9

As far as I understand:

My pods communicate IP to IP: 192.100.0.12 (cp), 192.100.0.21 (worker1).

Within this network, the network Pod, Cilium, creates a virtual network where is assigned the ip
10.96.0.1 as the api server, and uses the port:443 (instead of 6443!).

But for some reason, Cilium cannot connect to the api-server from a worker node. The cilium pod in cp works:

kubectl get pods -n kube-system -o wide
 
NAME                              READY   STATUS                  RESTARTS       AGE     IP               NODE      NOMINATED NODE   READINESS GATES
... #  ↓↓↓  cp cilium pod  ↓↓↓
cilium-f559g                      1/1     Running                 6 (14m ago)    7d21h   192.168.100.12   cp        <none>           <none>
... #   ↓↓↓  worker1 cilium pod  ↓↓↓
cilium-x6v75                      0/1     Init:CrashLoopBackOff   9 (113s ago)   33m     192.168.100.21   
...

kubectl describe pod -n kube-system cilium-x6v75
 
...
Node:                 worker1/192.168.100.21
...
Status:               Pending
IP:                   192.168.100.21
IPs:
  IP:           192.168.100.21
Controlled By:  DaemonSet/cilium
Init Containers:
  config:
    Container ID:  containerd://4e3a2d50a0a18b79f60d4730ab4572bd23390407f4c9d28055ada37d15047ae6
    Image:         quay.io/cilium/cilium:v1.17.1@sha256:8969bfd9c87cbea91e40665f8ebe327268c99d844ca26d7d12165de07f702866
    Image ID:      quay.io/cilium/cilium@sha256:8969bfd9c87cbea91e40665f8ebe327268c99d844ca26d7d12165de07f702866
    Port:          <none>
    Host Port:     <none>
    Command:
      cilium-dbg
      build-config
    State:       Waiting
      Reason:    CrashLoopBackOff
    Last State:  Terminated
      Reason:    Error
      Message:   Running
2025/03/09 16:13:35 INFO Starting hive
time="2025-03-09T16:13:35.42543063Z" level=info msg="Establishing connection to apiserver" host="https://10.96.0.1:443" subsys=k8s-client
time="2025-03-09T16:14:10.447745411Z" level=info msg="Establishing connection to apiserver" host="https://10.96.0.1:443" subsys=k8s-client
time="2025-03-09T16:14:40.46315002Z" level=error msg="Unable to contact k8s api-server" error="Get \"https://10.96.0.1:443/api/v1/namespaces/kube-system\": dial tcp 10.96.0.1:443: i/o timeout" ipAddr="https://10.96.0.1:443" subsys=k8s-client
2025/03/09 16:14:40 ERROR Start hook failed function="client.(*compositeClientset).onStart (k8s-client)" error="Get \"https://10.96.0.1:443/api/v1/namespaces/kube-system\": dial tcp 10.96.0.1:443: i/o timeout"
2025/03/09 16:14:40 ERROR Start failed error="Get \"https://10.96.0.1:443/api/v1/namespaces/kube-system\": dial tcp 10.96.0.1:443: i/o timeout" duration=1m5.037880326s
2025/03/09 16:14:40 INFO Stopping
Error: Build config failed: failed to start: Get "https://10.96.0.1:443/api/v1/namespaces/kube-system": dial tcp 10.96.0.1:443: i/o timeout
 
...

chrispokorni · March 10

Hi @fasmy,

The describe of the cilium pod shows several timeouts. However, from your cluster's details it seems you have underprovisioned your VirtualBox VMs. For an operational cluster on VBox, each VM needs 2 CPU cores, 8 GB RAM, 20+ GB vdisk fully provisioned. Dynamically provisioned disk causes your kubelets to panic and as a result they are not deploying critical workload on the nodes.

However, I cannot make any comments on the cluster bootstrapping and the Cilium CNI installation. From your detailed notes it seems that you deviated from the lab guide and disregarded the installation, configuration and bootstrapping steps provided to you. A custom install is outside of the course's scope, and troubleshooting such a custom installation is out of the scope for the course's support staff. This forum is dedicated to the content of the course, both lectures and lab exercises. I highly recommend following the lab guide in order to maximize your chances of successfully completing the lab exercises.

Regards,
-Chris

fasmy · March 10

Understood ! I'll do the course install.

I am glad I did a full install based on the official, up-to-date documentation, as I learned a lot troubleshooting it on my own (Linux, Kubernetes). However, I am indeed getting a bit tired.

I know this very last hurdle has everything to do with Cilium. I used another CNI and it works like a charm.

I wanted to ensure I could do it on my own, with whatever setup I might need. Tutorials made to work first time often let you stuck in practice. Now I know... and I did it!

Cilium Crash-Loop on worker nodes

Welcome!

Comments

Welcome!

Welcome!

Quick Links

Categories

Upcoming Training

Kubernetes Administration (LFS458)

Linux System Administration (LFS301)

Open Source Virtualization (LFS462)

Linux Kernel Debugging and Security (LFD440)