Welcome to the Linux Foundation Forum!

Cilium Crash-Loop on worker nodes

fasmy
fasmy Posts: 26
edited March 7 in LFS258 Class Forum

Hi,

I recently setup my control-plane node. It works as intended. Everything OK!

I added cilium as my networking pod. OK!

I now join my worker nodes: worker1 and worker2 OK!

Cilium initialize, and then... Init:CrashLoopBackoff !

If Cilium on my control-plane works, why would my worker nodes fail?
Both worker are a carbon copy of the cp node, before I initialized the network.

Feel free to ask questions, or offer machine-breaking solutions. I use snapshots to reset my VMs if anything goes wrong. I am quite stuck and I am unsure on how to troubleshoot this situation. I believe there might be a simple connectivity issue between my workers and the cp.


More information:

Using VirtualBox, I created my own control plane with two worker nodes.
They use NAT for a direct internet connection, and a host only adapter for inter-node communication- Host only adapter interfaces:

  • cp: 192.168.100.12
  • worker1: 192.168.100.21
  • worker2: 192.168.100.22

ping cp ↔ worker1 ↔ worker2 OK

All nodes use:

  • kubectl, kubeadm, kubelet V1.32.2
  • containerd for the runtime.

    • use systemd : OK
    • updated pause:3.8 image to pause:3.10 image : OK
  • hostname set

    • cp cat /etc/hosts
      127.0.0.1 localhost
      192.168.100.12 k8scp cp
      cat /etc/hostname
      cp

    • worker1 cat /etc/hosts
      127.0.0.1 localhost
      192.168.100.12 k8scp
      192.168.100.21 worker1
      cat /etc/hostname
      worker1

    • worker2 cat /etc/hosts
      127.0.0.1 localhost
      192.168.100.12 k8scp
      192.168.100.22 worker2
      cat /etc/hostname
      worker2

I used config files to init my network and join my nodes:

  • cp: init-defaults.yaml (I had to save it as txt to updload it to the forum)
  • worker1: init-defaults.yaml
    Notice: "unsafeSkipCAVerification: true". The Host-Only Adapter Network cannot be reached from the internet. It's safe!

Comments

  • fasmy
    fasmy Posts: 26
    edited March 7

    Here is my status before joining worker nodes:
    kubectl get pods -A

    NAMESPACE NAME READY STATUS RESTARTS AGE
    kube-system cilium-envoy-dsjhg 1/1 Running 2 (13m ago) 5d19h
    kube-system cilium-f559g 1/1 Running 2 (13m ago) 5d19h
    kube-system cilium-operator-f4589588d-w9jpf 1/1 Running 2 (13m ago) 5d19h
    kube-system coredns-668d6bf9bc-kkpgn 1/1 Running 2 (13m ago) 5d21h
    kube-system coredns-668d6bf9bc-nl4mk 1/1 Running 2 (13m ago) 5d21h
    kube-system etcd-cp 1/1 Running 3 (13m ago) 5d21h
    kube-system kube-apiserver-cp 1/1 Running 3 (13m ago) 5d21h
    kube-system kube-controller-manager-cp 1/1 Running 3 (13m ago) 5d21h
    kube-system kube-proxy-vr2js 1/1 Running 3 (13m ago) 5d21h
    kube-system kube-scheduler-cp 1/1 Running 3 (13m ago) 5d21h

    Here is my status after joining worker node 1 (notice the Init:0/6, it will crash-loop or Init:Error):
    kubectl get pods -A

    NAMESPACE NAME READY STATUS RESTARTS AGE
    kube-system cilium-envoy-dsjhg 1/1 Running 2 (18m ago) 5d19h
    kube-system cilium-envoy-qt4j2 1/1 Running 0 3m53s
    kube-system cilium-f559g 1/1 Running 2 (18m ago) 5d19h
    kube-system cilium-operator-f4589588d-w9jpf 1/1 Running 2 (18m ago) 5d19h
    kube-system cilium-vlpq5 0/1 Init:0/6 2 (60s ago) 3m53s
    kube-system coredns-668d6bf9bc-kkpgn 1/1 Running 2 (18m ago) 5d21h
    kube-system coredns-668d6bf9bc-nl4mk 1/1 Running 2 (18m ago) 5d21h
    kube-system etcd-cp 1/1 Running 3 (18m ago) 5d21h
    kube-system kube-apiserver-cp 1/1 Running 3 (18m ago) 5d21h
    kube-system kube-controller-manager-cp 1/1 Running 3 (18m ago) 5d21h
    kube-system kube-proxy-dtvrn 1/1 Running 0 3m53s
    kube-system kube-proxy-vr2js 1/1 Running 3 (18m ago) 5d21h
    kube-system kube-scheduler-cp 1/1 Running 3 (18m ago) 5d21h

    UPDATE:
    | kube-system | cilium-vlpq5 | 0/1 | Init:Error | 2 (60s ago) | 3m53s |

  • fasmy
    fasmy Posts: 26

    Here is the description of the failing pod !
    Notice the event:

    Warning BackOff 57s (x14 over 6m58s) kubelet Back-off restarting failed container config in pod cilium-vlpq5_kube-system(29d95fb8-38eb-4d33-9091-287827c2e73b)

  • fasmy
    fasmy Posts: 26

    I did not setup ipv6, so I wonder why cilium created ipv6 interfaces...

  • chrispokorni
    chrispokorni Posts: 2,449

    Hi @fasmy,

    Is this content related in any way with the LFS253 - Containers Fundamentals course? This is the forum for LFS253...

    Let me know if this was not your intent, and eventually what course this relates to so I can move this entire thread to the forum it belongs to.

    Regards,
    -Chris

  • fasmy
    fasmy Posts: 26

    I am very sorry about that. It is related to LFS258, I think I'll update my shortcuts now!

  • chrispokorni
    chrispokorni Posts: 2,449

    Hi @fasmy,

    Thank you for confirming the forum. I moved this thread to the LFS258 forum.

    By taking a quick look at the describe details, I see a few timeout errors.
    It is not yet clear what causes them, so we'd need to evaluate the state of your cluster overall.

    Please provide the following:

    • VM OS distribution and release,
    • CPU count,
    • RAM,
    • disk size (whether dynamically allocated of fully pre-allocated),
    • how many network interfaces and types (nat, bridge, host, ...),
    • IP addresses assigned to each interface,
    • type of VM - local hypervisor or cloud,
    • how are firewalls setup by the hypervisor, or the cloud VPC with firewalls/SG, etc...

    In addition, please provide outputs of:

    kubectl get nodes -o wide
    kubectl get pods -A -o wide
    kubectl get svc,ep -A -o wide
    

    Assuming you followed the instructions as provided, and that kubeadm-config.yaml manifest is copied into /root/, and it was used to initialize the cluster according to the lab guide, what is the value of podSubnet parameter:

    grep podSubnet /root/kubeadm-config.yaml
    

    Update the path to the cilium-cni.yaml manifest if needed, and provide the value of the cluster-pool-ipv4-cidr parameter, again assuming that the manifest was used to launch the Cilium CNI plugin according to the lab guide:

    grep cluster-pool-ipv4-cidr /home/student/LFS258/SOLUTIONS/s_03/cilium-cni.yaml
    

    Regards,
    -Chris

  • fasmy
    fasmy Posts: 26
    edited March 8
    • VM OS distribution and release: Oracle VirtualBox → ubuntu-24.04.1-live-server-amd64.iso
    • CPU count: 2
    • RAM: 4096 MB
    • disk size: 12 GB, Dynamically allocated differencing storage
    • how many network interfaces and types:
      VirtualBox config:

      • Adapter1: Default NAT (port forwarding: kubeAPI 6443→6443, ssh 222X→22, replace X with {cp=2, worker1=3, worker2=4})
      • Adapter2: Host-only Adapter (created in VirtualBox Networks) on 192.168.100.1/24, DHCP Disabled

      cp, worker1, and worker2 interfaces (before cilium):

      • lo (loopback) 127.0.0.1/8
      • enp0s3 (NAT) 10.0.2.15/24 Note: cp, worker1, and worker2 use the same address.
      • enp0s8 (Host-Only) 192.168.100.XX
        • cp → 11 12 (new cp)
        • worker1 → 21
        • worker2 → 22

      cp interfaces (after cilium):

      • Please find the ip a dump in the above-provided file cp-ip-tables.txt
      • Note: Only cp has a new interfaces. No changes on worker nodes.
      • Cilium uses ipv6! I didn't config my network to use IPv6. The issue most likely lies around here. I might have to rebuild my VMS (from snapshot). But before restarting again, I would appreciate confirmation that I am not misguided. ;)
    • IP addresses assigned to each interface
      I believe I had issues understanding the intricacies of Kubernetes networking. As I understand, there are 3 address types that we must setup and avoid collision with, but I couldn't cite them.. I know I used 192.168.100.X to connect my nodes to the kubeAPI-server, as seen in init-defaults.txt (file above), but I do wonder where I could config the other two types of addresses. In the Networking Pod I suppose.

    • type of VM: local hypervisor or cloud,

    • how are firewalls setup? : None! I have no firewalls, I double checked.
  • fasmy
    fasmy Posts: 26
    kubectl get nodes -o wide
        NAME      STATUS     ROLES           AGE   VERSION   INTERNAL-IP      EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION     CONTAINER-RUNTIME
        cp        Ready      control-plane   7d    v1.32.2   192.168.100.12   <none>        Ubuntu 24.04.1 LTS   6.8.0-54-generic   containerd://1.7.24
        worker1   NotReady   <none>          27h   v1.32.2   192.168.100.21   <none>        Ubuntu 24.04.1 LTS   6.8.0-54-generic   containerd://1.7.24
    
  • fasmy
    fasmy Posts: 26
    kubectl get pods -A -o wide
        NAMESPACE     NAME                              READY   STATUS                  RESTARTS        AGE     IP               NODE      NOMINATED NODE   READINESS GATES
        kube-system   cilium-envoy-dsjhg                1/1     Running                 3 (6h27m ago)   6d23h   192.168.100.12   cp        <none>           <none>
        kube-system   cilium-envoy-qt4j2                1/1     Running                 1 (6h34m ago)   27h     192.168.100.21   worker1   <none>           <none>
        kube-system   cilium-f559g                      1/1     Running                 3 (6h27m ago)   6d23h   192.168.100.12   cp        <none>           <none>
        kube-system   cilium-operator-f4589588d-w9jpf   1/1     Running                 3 (6h27m ago)   6d23h   192.168.100.12   cp        <none>           <none>
        kube-system   cilium-vlpq5                      0/1     Init:CrashLoopBackOff   14 (6h1m ago)   27h     192.168.100.21   worker1   <none>           <none>
        kube-system   coredns-668d6bf9bc-kkpgn          1/1     Running                 3 (6h27m ago)   7d      10.0.0.194       cp        <none>           <none>
        kube-system   coredns-668d6bf9bc-nl4mk          1/1     Running                 3 (6h27m ago)   7d      10.0.0.209       cp        <none>           <none>
        kube-system   etcd-cp                           1/1     Running                 4 (6h27m ago)   7d      192.168.100.12   cp        <none>           <none>
        kube-system   kube-apiserver-cp                 1/1     Running                 4 (6h27m ago)   7d      192.168.100.12   cp        <none>           <none>
        kube-system   kube-controller-manager-cp        1/1     Running                 4 (6h27m ago)   7d      192.168.100.12   cp        <none>           <none>
        kube-system   kube-proxy-dtvrn                  1/1     Running                 1 (6h34m ago)   27h     192.168.100.21   worker1   <none>           <none>
        kube-system   kube-proxy-vr2js                  1/1     Running                 4 (6h27m ago)   7d      192.168.100.12   cp        <none>           <none>
        kube-system   kube-scheduler-cp                 1/1     Running                 4 (6h27m ago)   7d      192.168.100.12   cp        <none>           <none>
    
  • fasmy
    fasmy Posts: 26
    kubectl get svc,ep -A -o wide
        NAMESPACE     NAME                   TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)                  AGE     SELECTOR
        default       service/kubernetes     ClusterIP   10.96.0.1       <none>        443/TCP                  7d      <none>
        kube-system   service/cilium-envoy   ClusterIP   None            <none>        9964/TCP                 6d23h   k8s-app=cilium-envoy
        kube-system   service/hubble-peer    ClusterIP   10.107.23.234   <none>        443/TCP                  6d23h   k8s-app=cilium
        kube-system   service/kube-dns       ClusterIP   10.96.0.10      <none>        53/UDP,53/TCP,9153/TCP   7d      k8s-app=kube-dns
    
        NAMESPACE     NAME                     ENDPOINTS                                               AGE
        default       endpoints/kubernetes     192.168.100.12:6443                                     7d
        kube-system   endpoints/cilium-envoy   192.168.100.12:9964,192.168.100.21:9964                 6d23h
        kube-system   endpoints/hubble-peer    192.168.100.12:4244                                     6d23h
        kube-system   endpoints/kube-dns       10.0.0.194:53,10.0.0.209:53,10.0.0.194:53 + 3 more...   7d
    
  • fasmy
    fasmy Posts: 26

    There is no podSubnet to my config file. I used the config with kubeadm init --config ~/config/init-defaults.yaml.
    Find above: init-defaults.txt

    grep podSubnet ~/config/init-defaults.yaml
    # Nothing
    

    I id config ControlPlaneEndpoint: k8scp:6443,
    and InitConfiguration's LocalAPIEndpoint {advertiseAddress: 192.168.100.12, bindPort: 6443}

  • fasmy
    fasmy Posts: 26
    edited March 8

    I followed Cilium's default instatlation (Generic):
    https://docs.cilium.io/en/stable/gettingstarted/k8s-install-default/

    Sorry for the piecemeal answer, the Forum's protection is going off on me, preventing me from posting. It was due to Cilium's command lines.

  • fasmy
    fasmy Posts: 26

    As far as I understand:

    My pods communicate IP to IP: 192.100.0.12 (cp), 192.100.0.21 (worker1).

    Within this network, the network Pod, Cilium, creates a virtual network where is assigned the ip
    10.96.0.1 as the api server, and uses the port:443 (instead of 6443!).

    But for some reason, Cilium cannot connect to the api-server from a worker node. The cilium pod in cp works:

    kubectl get pods -n kube-system -o wide
    
    NAME                              READY   STATUS                  RESTARTS       AGE     IP               NODE      NOMINATED NODE   READINESS GATES
    ... #  ↓↓↓  cp cilium pod  ↓↓↓
    cilium-f559g                      1/1     Running                 6 (14m ago)    7d21h   192.168.100.12   cp        <none>           <none>
    ... #   ↓↓↓  worker1 cilium pod  ↓↓↓
    cilium-x6v75                      0/1     Init:CrashLoopBackOff   9 (113s ago)   33m     192.168.100.21   
    ...
    
    kubectl describe pod -n kube-system cilium-x6v75
    
    ...
    Node:                 worker1/192.168.100.21
    ...
    Status:               Pending
    IP:                   192.168.100.21
    IPs:
      IP:           192.168.100.21
    Controlled By:  DaemonSet/cilium
    Init Containers:
      config:
        Container ID:  containerd://4e3a2d50a0a18b79f60d4730ab4572bd23390407f4c9d28055ada37d15047ae6
        Image:         quay.io/cilium/cilium:v1.17.1@sha256:8969bfd9c87cbea91e40665f8ebe327268c99d844ca26d7d12165de07f702866
        Image ID:      quay.io/cilium/cilium@sha256:8969bfd9c87cbea91e40665f8ebe327268c99d844ca26d7d12165de07f702866
        Port:          <none>
        Host Port:     <none>
        Command:
          cilium-dbg
          build-config
        State:       Waiting
          Reason:    CrashLoopBackOff
        Last State:  Terminated
          Reason:    Error
          Message:   Running
    2025/03/09 16:13:35 INFO Starting hive
    time="2025-03-09T16:13:35.42543063Z" level=info msg="Establishing connection to apiserver" host="https://10.96.0.1:443" subsys=k8s-client
    time="2025-03-09T16:14:10.447745411Z" level=info msg="Establishing connection to apiserver" host="https://10.96.0.1:443" subsys=k8s-client
    time="2025-03-09T16:14:40.46315002Z" level=error msg="Unable to contact k8s api-server" error="Get \"https://10.96.0.1:443/api/v1/namespaces/kube-system\": dial tcp 10.96.0.1:443: i/o timeout" ipAddr="https://10.96.0.1:443" subsys=k8s-client
    2025/03/09 16:14:40 ERROR Start hook failed function="client.(*compositeClientset).onStart (k8s-client)" error="Get \"https://10.96.0.1:443/api/v1/namespaces/kube-system\": dial tcp 10.96.0.1:443: i/o timeout"
    2025/03/09 16:14:40 ERROR Start failed error="Get \"https://10.96.0.1:443/api/v1/namespaces/kube-system\": dial tcp 10.96.0.1:443: i/o timeout" duration=1m5.037880326s
    2025/03/09 16:14:40 INFO Stopping
    Error: Build config failed: failed to start: Get "https://10.96.0.1:443/api/v1/namespaces/kube-system": dial tcp 10.96.0.1:443: i/o timeout
    
    ...
    
  • chrispokorni
    chrispokorni Posts: 2,449

    Hi @fasmy,

    The describe of the cilium pod shows several timeouts. However, from your cluster's details it seems you have underprovisioned your VirtualBox VMs. For an operational cluster on VBox, each VM needs 2 CPU cores, 8 GB RAM, 20+ GB vdisk fully provisioned. Dynamically provisioned disk causes your kubelets to panic and as a result they are not deploying critical workload on the nodes.

    However, I cannot make any comments on the cluster bootstrapping and the Cilium CNI installation. From your detailed notes it seems that you deviated from the lab guide and disregarded the installation, configuration and bootstrapping steps provided to you. A custom install is outside of the course's scope, and troubleshooting such a custom installation is out of the scope for the course's support staff. This forum is dedicated to the content of the course, both lectures and lab exercises. I highly recommend following the lab guide in order to maximize your chances of successfully completing the lab exercises.

    Regards,
    -Chris

  • fasmy
    fasmy Posts: 26

    Understood ! I'll do the course install.

    I am glad I did a full install based on the official, up-to-date documentation, as I learned a lot troubleshooting it on my own (Linux, Kubernetes). However, I am indeed getting a bit tired.

    I know this very last hurdle has everything to do with Cilium. I used another CNI and it works like a charm.

    I wanted to ensure I could do it on my own, with whatever setup I might need. Tutorials made to work first time often let you stuck in practice. Now I know... and I did it! :)

Categories

Upcoming Training