Welcome to the Linux Foundation Forum!

Cilium Crash-Loop on worker nodes

Posts: 22
edited March 7 in LFS258 Class Forum

Hi,

I recently setup my control-plane node. It works as intended. Everything OK!

I added cilium as my networking pod. OK!

I now join my worker nodes: worker1 and worker2 OK!

Cilium initialize, and then... Init:CrashLoopBackoff !

If Cilium on my control-plane works, why would my worker nodes fail?
Both worker are a carbon copy of the cp node, before I initialized the network.

Feel free to ask questions, or offer machine-breaking solutions. I use snapshots to reset my VMs if anything goes wrong. I am quite stuck and I am unsure on how to troubleshoot this situation. I believe there might be a simple connectivity issue between my workers and the cp.


More information:

Using VirtualBox, I created my own control plane with two worker nodes.
They use NAT for a direct internet connection, and a host only adapter for inter-node communication- Host only adapter interfaces:

  • cp: 192.168.100.12
  • worker1: 192.168.100.21
  • worker2: 192.168.100.22

ping cp ↔ worker1 ↔ worker2 OK

All nodes use:

  • kubectl, kubeadm, kubelet V1.32.2
  • containerd for the runtime.

    • use systemd : OK
    • updated pause:3.8 image to pause:3.10 image : OK
  • hostname set

    • cp cat /etc/hosts
      127.0.0.1 localhost
      192.168.100.12 k8scp cp
      cat /etc/hostname
      cp

    • worker1 cat /etc/hosts
      127.0.0.1 localhost
      192.168.100.12 k8scp
      192.168.100.21 worker1
      cat /etc/hostname
      worker1

    • worker2 cat /etc/hosts
      127.0.0.1 localhost
      192.168.100.12 k8scp
      192.168.100.22 worker2
      cat /etc/hostname
      worker2

I used config files to init my network and join my nodes:

  • cp: init-defaults.yaml (I had to save it as txt to updload it to the forum)
  • worker1: init-defaults.yaml
    Notice: "unsafeSkipCAVerification: true". The Host-Only Adapter Network cannot be reached from the internet. It's safe!

Welcome!

It looks like you're new here. Sign in or register to get started.
Sign In

Comments

  • Posts: 22
    edited March 7

    Here is my status before joining worker nodes:
    kubectl get pods -A

    NAMESPACE NAME READY STATUS RESTARTS AGE
    kube-system cilium-envoy-dsjhg 1/1 Running 2 (13m ago) 5d19h
    kube-system cilium-f559g 1/1 Running 2 (13m ago) 5d19h
    kube-system cilium-operator-f4589588d-w9jpf 1/1 Running 2 (13m ago) 5d19h
    kube-system coredns-668d6bf9bc-kkpgn 1/1 Running 2 (13m ago) 5d21h
    kube-system coredns-668d6bf9bc-nl4mk 1/1 Running 2 (13m ago) 5d21h
    kube-system etcd-cp 1/1 Running 3 (13m ago) 5d21h
    kube-system kube-apiserver-cp 1/1 Running 3 (13m ago) 5d21h
    kube-system kube-controller-manager-cp 1/1 Running 3 (13m ago) 5d21h
    kube-system kube-proxy-vr2js 1/1 Running 3 (13m ago) 5d21h
    kube-system kube-scheduler-cp 1/1 Running 3 (13m ago) 5d21h

    Here is my status after joining worker node 1 (notice the Init:0/6, it will crash-loop or Init:Error):
    kubectl get pods -A

    NAMESPACE NAME READY STATUS RESTARTS AGE
    kube-system cilium-envoy-dsjhg 1/1 Running 2 (18m ago) 5d19h
    kube-system cilium-envoy-qt4j2 1/1 Running 0 3m53s
    kube-system cilium-f559g 1/1 Running 2 (18m ago) 5d19h
    kube-system cilium-operator-f4589588d-w9jpf 1/1 Running 2 (18m ago) 5d19h
    kube-system cilium-vlpq5 0/1 Init:0/6 2 (60s ago) 3m53s
    kube-system coredns-668d6bf9bc-kkpgn 1/1 Running 2 (18m ago) 5d21h
    kube-system coredns-668d6bf9bc-nl4mk 1/1 Running 2 (18m ago) 5d21h
    kube-system etcd-cp 1/1 Running 3 (18m ago) 5d21h
    kube-system kube-apiserver-cp 1/1 Running 3 (18m ago) 5d21h
    kube-system kube-controller-manager-cp 1/1 Running 3 (18m ago) 5d21h
    kube-system kube-proxy-dtvrn 1/1 Running 0 3m53s
    kube-system kube-proxy-vr2js 1/1 Running 3 (18m ago) 5d21h
    kube-system kube-scheduler-cp 1/1 Running 3 (18m ago) 5d21h

    UPDATE:
    | kube-system | cilium-vlpq5 | 0/1 | Init:Error | 2 (60s ago) | 3m53s |

  • Posts: 22

    Here is the description of the failing pod !
    Notice the event:

    Warning BackOff 57s (x14 over 6m58s) kubelet Back-off restarting failed container config in pod cilium-vlpq5_kube-system(29d95fb8-38eb-4d33-9091-287827c2e73b)

  • Posts: 22

    I did not setup ipv6, so I wonder why cilium created ipv6 interfaces...

  • Posts: 2,434

    Hi @fasmy,

    Is this content related in any way with the LFS253 - Containers Fundamentals course? This is the forum for LFS253...

    Let me know if this was not your intent, and eventually what course this relates to so I can move this entire thread to the forum it belongs to.

    Regards,
    -Chris

  • Posts: 22

    I am very sorry about that. It is related to LFS258, I think I'll update my shortcuts now!

  • Posts: 2,434

    Hi @fasmy,

    Thank you for confirming the forum. I moved this thread to the LFS258 forum.

    By taking a quick look at the describe details, I see a few timeout errors.
    It is not yet clear what causes them, so we'd need to evaluate the state of your cluster overall.

    Please provide the following:

    • VM OS distribution and release,
    • CPU count,
    • RAM,
    • disk size (whether dynamically allocated of fully pre-allocated),
    • how many network interfaces and types (nat, bridge, host, ...),
    • IP addresses assigned to each interface,
    • type of VM - local hypervisor or cloud,
    • how are firewalls setup by the hypervisor, or the cloud VPC with firewalls/SG, etc...

    In addition, please provide outputs of:

    1. kubectl get nodes -o wide
    2. kubectl get pods -A -o wide
    3. kubectl get svc,ep -A -o wide

    Assuming you followed the instructions as provided, and that kubeadm-config.yaml manifest is copied into /root/, and it was used to initialize the cluster according to the lab guide, what is the value of podSubnet parameter:

    1. grep podSubnet /root/kubeadm-config.yaml

    Update the path to the cilium-cni.yaml manifest if needed, and provide the value of the cluster-pool-ipv4-cidr parameter, again assuming that the manifest was used to launch the Cilium CNI plugin according to the lab guide:

    1. grep cluster-pool-ipv4-cidr /home/student/LFS258/SOLUTIONS/s_03/cilium-cni.yaml

    Regards,
    -Chris

  • Posts: 22
    edited March 8
    • VM OS distribution and release: Oracle VirtualBox → ubuntu-24.04.1-live-server-amd64.iso
    • CPU count: 2
    • RAM: 4096 MB
    • disk size: 12 GB, Dynamically allocated differencing storage
    • how many network interfaces and types:
      VirtualBox config:

      • Adapter1: Default NAT (port forwarding: kubeAPI 6443→6443, ssh 222X→22, replace X with {cp=2, worker1=3, worker2=4})
      • Adapter2: Host-only Adapter (created in VirtualBox Networks) on 192.168.100.1/24, DHCP Disabled

      cp, worker1, and worker2 interfaces (before cilium):

      • lo (loopback) 127.0.0.1/8
      • enp0s3 (NAT) 10.0.2.15/24 Note: cp, worker1, and worker2 use the same address.
      • enp0s8 (Host-Only) 192.168.100.XX
        • cp → 11 12 (new cp)
        • worker1 → 21
        • worker2 → 22

      cp interfaces (after cilium):

      • Please find the ip a dump in the above-provided file cp-ip-tables.txt
      • Note: Only cp has a new interfaces. No changes on worker nodes.
      • Cilium uses ipv6! I didn't config my network to use IPv6. The issue most likely lies around here. I might have to rebuild my VMS (from snapshot). But before restarting again, I would appreciate confirmation that I am not misguided. ;)
    • IP addresses assigned to each interface
      I believe I had issues understanding the intricacies of Kubernetes networking. As I understand, there are 3 address types that we must setup and avoid collision with, but I couldn't cite them.. I know I used 192.168.100.X to connect my nodes to the kubeAPI-server, as seen in init-defaults.txt (file above), but I do wonder where I could config the other two types of addresses. In the Networking Pod I suppose.

    • type of VM: local hypervisor or cloud,

    • how are firewalls setup? : None! I have no firewalls, I double checked.
  • Posts: 22
    1. kubectl get nodes -o wide
    2. NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
    3. cp Ready control-plane 7d v1.32.2 192.168.100.12 <none> Ubuntu 24.04.1 LTS 6.8.0-54-generic containerd://1.7.24
    4. worker1 NotReady <none> 27h v1.32.2 192.168.100.21 <none> Ubuntu 24.04.1 LTS 6.8.0-54-generic containerd://1.7.24
  • Posts: 22
    1. kubectl get pods -A -o wide
    2. NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
    3. kube-system cilium-envoy-dsjhg 1/1 Running 3 (6h27m ago) 6d23h 192.168.100.12 cp <none> <none>
    4. kube-system cilium-envoy-qt4j2 1/1 Running 1 (6h34m ago) 27h 192.168.100.21 worker1 <none> <none>
    5. kube-system cilium-f559g 1/1 Running 3 (6h27m ago) 6d23h 192.168.100.12 cp <none> <none>
    6. kube-system cilium-operator-f4589588d-w9jpf 1/1 Running 3 (6h27m ago) 6d23h 192.168.100.12 cp <none> <none>
    7. kube-system cilium-vlpq5 0/1 Init:CrashLoopBackOff 14 (6h1m ago) 27h 192.168.100.21 worker1 <none> <none>
    8. kube-system coredns-668d6bf9bc-kkpgn 1/1 Running 3 (6h27m ago) 7d 10.0.0.194 cp <none> <none>
    9. kube-system coredns-668d6bf9bc-nl4mk 1/1 Running 3 (6h27m ago) 7d 10.0.0.209 cp <none> <none>
    10. kube-system etcd-cp 1/1 Running 4 (6h27m ago) 7d 192.168.100.12 cp <none> <none>
    11. kube-system kube-apiserver-cp 1/1 Running 4 (6h27m ago) 7d 192.168.100.12 cp <none> <none>
    12. kube-system kube-controller-manager-cp 1/1 Running 4 (6h27m ago) 7d 192.168.100.12 cp <none> <none>
    13. kube-system kube-proxy-dtvrn 1/1 Running 1 (6h34m ago) 27h 192.168.100.21 worker1 <none> <none>
    14. kube-system kube-proxy-vr2js 1/1 Running 4 (6h27m ago) 7d 192.168.100.12 cp <none> <none>
    15. kube-system kube-scheduler-cp 1/1 Running 4 (6h27m ago) 7d 192.168.100.12 cp <none> <none>
  • Posts: 22
    1. kubectl get svc,ep -A -o wide
    2. NAMESPACE NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR
    3. default service/kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 7d <none>
    4. kube-system service/cilium-envoy ClusterIP None <none> 9964/TCP 6d23h k8s-app=cilium-envoy
    5. kube-system service/hubble-peer ClusterIP 10.107.23.234 <none> 443/TCP 6d23h k8s-app=cilium
    6. kube-system service/kube-dns ClusterIP 10.96.0.10 <none> 53/UDP,53/TCP,9153/TCP 7d k8s-app=kube-dns
    7.  
    8. NAMESPACE NAME ENDPOINTS AGE
    9. default endpoints/kubernetes 192.168.100.12:6443 7d
    10. kube-system endpoints/cilium-envoy 192.168.100.12:9964,192.168.100.21:9964 6d23h
    11. kube-system endpoints/hubble-peer 192.168.100.12:4244 6d23h
    12. kube-system endpoints/kube-dns 10.0.0.194:53,10.0.0.209:53,10.0.0.194:53 + 3 more... 7d
  • Posts: 22

    There is no podSubnet to my config file. I used the config with kubeadm init --config ~/config/init-defaults.yaml.
    Find above: init-defaults.txt

    1. grep podSubnet ~/config/init-defaults.yaml
    2. # Nothing

    I id config ControlPlaneEndpoint: k8scp:6443,
    and InitConfiguration's LocalAPIEndpoint {advertiseAddress: 192.168.100.12, bindPort: 6443}

  • Posts: 22
    edited March 8

    I followed Cilium's default instatlation (Generic):
    https://docs.cilium.io/en/stable/gettingstarted/k8s-install-default/

    Sorry for the piecemeal answer, the Forum's protection is going off on me, preventing me from posting. It was due to Cilium's command lines.

  • Posts: 22

    As far as I understand:

    My pods communicate IP to IP: 192.100.0.12 (cp), 192.100.0.21 (worker1).

    Within this network, the network Pod, Cilium, creates a virtual network where is assigned the ip
    10.96.0.1 as the api server, and uses the port:443 (instead of 6443!).

    But for some reason, Cilium cannot connect to the api-server from a worker node. The cilium pod in cp works:

    1. kubectl get pods -n kube-system -o wide
    2.  
    3. NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
    4. ... # ↓↓↓ cp cilium pod ↓↓↓
    5. cilium-f559g 1/1 Running 6 (14m ago) 7d21h 192.168.100.12 cp <none> <none>
    6. ... # ↓↓↓ worker1 cilium pod ↓↓↓
    7. cilium-x6v75 0/1 Init:CrashLoopBackOff 9 (113s ago) 33m 192.168.100.21
    8. ...
    1. kubectl describe pod -n kube-system cilium-x6v75
    2.  
    3. ...
    4. Node: worker1/192.168.100.21
    5. ...
    6. Status: Pending
    7. IP: 192.168.100.21
    8. IPs:
    9. IP: 192.168.100.21
    10. Controlled By: DaemonSet/cilium
    11. Init Containers:
    12. config:
    13. Container ID: containerd://4e3a2d50a0a18b79f60d4730ab4572bd23390407f4c9d28055ada37d15047ae6
    14. Image: quay.io/cilium/cilium:v1.17.1@sha256:8969bfd9c87cbea91e40665f8ebe327268c99d844ca26d7d12165de07f702866
    15. Image ID: quay.io/cilium/cilium@sha256:8969bfd9c87cbea91e40665f8ebe327268c99d844ca26d7d12165de07f702866
    16. Port: <none>
    17. Host Port: <none>
    18. Command:
    19. cilium-dbg
    20. build-config
    21. State: Waiting
    22. Reason: CrashLoopBackOff
    23. Last State: Terminated
    24. Reason: Error
    25. Message: Running
    26. 2025/03/09 16:13:35 INFO Starting hive
    27. time="2025-03-09T16:13:35.42543063Z" level=info msg="Establishing connection to apiserver" host="https://10.96.0.1:443" subsys=k8s-client
    28. time="2025-03-09T16:14:10.447745411Z" level=info msg="Establishing connection to apiserver" host="https://10.96.0.1:443" subsys=k8s-client
    29. time="2025-03-09T16:14:40.46315002Z" level=error msg="Unable to contact k8s api-server" error="Get \"https://10.96.0.1:443/api/v1/namespaces/kube-system\": dial tcp 10.96.0.1:443: i/o timeout" ipAddr="https://10.96.0.1:443" subsys=k8s-client
    30. 2025/03/09 16:14:40 ERROR Start hook failed function="client.(*compositeClientset).onStart (k8s-client)" error="Get \"https://10.96.0.1:443/api/v1/namespaces/kube-system\": dial tcp 10.96.0.1:443: i/o timeout"
    31. 2025/03/09 16:14:40 ERROR Start failed error="Get \"https://10.96.0.1:443/api/v1/namespaces/kube-system\": dial tcp 10.96.0.1:443: i/o timeout" duration=1m5.037880326s
    32. 2025/03/09 16:14:40 INFO Stopping
    33. Error: Build config failed: failed to start: Get "https://10.96.0.1:443/api/v1/namespaces/kube-system": dial tcp 10.96.0.1:443: i/o timeout
    34.  
    35. ...
  • Posts: 2,434

    Hi @fasmy,

    The describe of the cilium pod shows several timeouts. However, from your cluster's details it seems you have underprovisioned your VirtualBox VMs. For an operational cluster on VBox, each VM needs 2 CPU cores, 8 GB RAM, 20+ GB vdisk fully provisioned. Dynamically provisioned disk causes your kubelets to panic and as a result they are not deploying critical workload on the nodes.

    However, I cannot make any comments on the cluster bootstrapping and the Cilium CNI installation. From your detailed notes it seems that you deviated from the lab guide and disregarded the installation, configuration and bootstrapping steps provided to you. A custom install is outside of the course's scope, and troubleshooting such a custom installation is out of the scope for the course's support staff. This forum is dedicated to the content of the course, both lectures and lab exercises. I highly recommend following the lab guide in order to maximize your chances of successfully completing the lab exercises.

    Regards,
    -Chris

  • Posts: 22

    Understood ! I'll do the course install.

    I am glad I did a full install based on the official, up-to-date documentation, as I learned a lot troubleshooting it on my own (Linux, Kubernetes). However, I am indeed getting a bit tired.

    I know this very last hurdle has everything to do with Cilium. I used another CNI and it works like a charm.

    I wanted to ensure I could do it on my own, with whatever setup I might need. Tutorials made to work first time often let you stuck in practice. Now I know... and I did it! :)

Welcome!

It looks like you're new here. Sign in or register to get started.
Sign In

Welcome!

It looks like you're new here. Sign in or register to get started.
Sign In

Categories

Upcoming Training