Welcome to the Linux Foundation Forum!

Lab 3.1 POD cilium-XXXXX always with STATUS: CrashLoopBackOff (Ubuntu 22.04)

Hi there,
I am having difficulties with the Instalation of Kubernetes on Ubuntu 22.04 LTS.
Everything went fine until step nr. 23. I was following all the previous steps and using all configuration files provided in the course tarball.

After step nr. 23:

root@ubuntu:˜# kubeadm init --config=kubeadm-config.yaml --upload-certs | tee kubeadm-init.out

the pods are looking like this:

ubuntu@ubuntu:~$ kubectl -n kube-system get pods
NAME                               READY   STATUS             RESTARTS        AGE
cilium-6d8fr                       0/1     CrashLoopBackOff   9 (4m40s ago)   26m
cilium-operator-788c7d7585-n9f26   1/1     Running            0               26m
cilium-operator-788c7d7585-rhqtx   0/1     Pending            0               26m
coredns-5d78c9869d-7sj44           0/1     Pending            0               29m
coredns-5d78c9869d-mbq28           0/1     Pending            0               29m
etcd-ubuntu                        1/1     Running            0               29m
kube-apiserver-ubuntu              1/1     Running            0               29m
kube-controller-manager-ubuntu     1/1     Running            0               29m
kube-proxy-rwgr4                   1/1     Running            0               29m
kube-scheduler-ubuntu              1/1     Running            0               29m

In description of cilium-6d8fr I am getting following event:

ubuntu@ubuntu:~$ kubectl -n kube-system describe pod cilium-6d8fr
...
Warning  Unhealthy  35m (x3 over 36m)    kubelet            Startup probe failed: Get "http://127.0.0.1:9879/healthz": dial tcp 127.0.0.1:9879: connect: connection refused
  Warning  BackOff    77s (x178 over 35m)  kubelet            Back-off restarting failed container cilium-agent in pod cilium-6d8fr_kube-system(33fd142a-b736-4a08-b8c2-4d1bc18f9f0a)

Can somebody explain or give me directions what to analyse next. Currently I am without a idea and cannot find anything usefull on the internet about this issue.

Thanks

Comments

  • Hi @okanovic.eldin,

    Ubuntu 22.04 LTS is known to interfere with the networking configuration of the Kubernetes components.
    Please use the recommended OS - Ubuntu 20.04 LTS.

    Regards,
    -Chris

  • Hi @chrispokorni

    Thank you for your comment.

    I forgot to mention in my question that I am testing on the Ubuntu 22.04 LTS (minimal Kernel configuration), and that was the problem all the time. It doesn't meet the Cilium Requierements for the Linux Kernel. After Analysis of cilium-pod logs I found out that most of Kernel modules for Cilium are missing.

    So I had to install linux extra kernel modules with:

    sudo apt-get update
    sudo apt install linux-generic
    sudo apt install linuc-generic-hwe-22.04
    

    And TADA:

    ubuntu@ubuntu:~$ kubectl -n kube-system get pods
    NAME                               READY   STATUS    RESTARTS      AGE
    cilium-operator-788c7d7585-k7lbd   0/1     Pending   0             16m
    cilium-operator-788c7d7585-xdkvq   1/1     Running   0             16m
    cilium-vdblt                       1/1     Running   0             16m
    coredns-5d78c9869d-kdnsm           1/1     Running   0             24m
    coredns-5d78c9869d-vntqd           1/1     Running   0             24m
    etcd-ubuntu                        1/1     Running   1 (18m ago)   24m
    kube-apiserver-ubuntu              1/1     Running   1 (18m ago)   24m
    kube-controller-manager-ubuntu     1/1     Running   1 (18m ago)   24m
    kube-proxy-dgkh9                   1/1     Running   1 (18m ago)   24m
    kube-scheduler-ubuntu              1/1     Running   1 (18m ago)   24m
    

    Just to avoid the confusion, I have installed kubernetes from the scratch, so the Pod-names above are different from my first post.

    Regards,
    Eldin

  • yorgji
    yorgji Posts: 11

    Hello,
    I got a similar problem even though I'm running Ubuntu 20.04.6 LTS.
    Any help would be really appreciated.

    cilium-ml8lj 0/1 Init:0/6 6 (52s ago) 9m

  • chrispokorni
    chrispokorni Posts: 2,434

    Hi @yorgji,

    It would be helpful if you also shared details of your lab environment. What cloud or hypervisor is hosting your VMs, size of the VMs (CPU, RAM, disk - fully allocated or dynamically allocated), how many network interfaces per VM (what type), how is the firewall filtering inbound traffic to the VMs? What are the private IP addresses of the VMs?

    Please run the following commands and supply their outputs:

    kubectl get nodes -o wide
    kubectl get pods -A -o wide
    kubectl -n kube-system describe pod cilium-ml8lj
    

    Regards,
    -Chris

  • yorgji
    yorgji Posts: 11
    edited February 6

    Hello @chrispokorni,
    Thanks for replying so quickly.
    Some of the info that you requested are in the body of the message, the rest can be found in the attached a file.

    [PDF attachment removed for cybersecurity concerns]

    This is the vagrant file that I'm using.

    Vagrant.configure("2") do |config|
      # Define common settings for both VMs
      config.vm.box = "ubuntu/focal64"  # Ubuntu 20.04
      config.vm.provider "virtualbox" do |vb|
        vb.memory = "8192"
        vb.cpus = 2
      end
    
      # Control Plane Node
      config.vm.define "cp-node" do |cp|
        cp.vm.hostname = "cp-node"
        cp.vm.network "private_network", type: "static", ip: "192.168.56.10"
        cp.vm.provider "virtualbox" do |vb|
        end
        cp.vm.provision "shell", inline: <<-SHELL
          sudo swapoff -a
          sudo sed -i '/ swap / s/^/#/' /etc/fstab
          sudo systemctl disable --now ufw
          sudo systemctl stop apparmor
          sudo systemctl disable --now apparmor
        SHELL
      end
    
      # Worker Node
      config.vm.define "worker-node" do |worker|
        worker.vm.hostname = "worker-node"
        worker.vm.network "private_network", type: "static", ip: "192.168.56.11"
        worker.vm.provision "shell", inline: <<-SHELL
          sudo swapoff -a
          sudo sed -i '/ swap / s/^/#/' /etc/fstab
          sudo systemctl disable --now ufw
          sudo systemctl stop apparmor
          sudo systemctl disable --now apparmor
        SHELL
      end
    end
    

    Hypervisor : virtualbox
    2 CPUs
    8 GB RAM
    40 GB is the default storage of the focal64 image (probably static)
    192.168.56.10 the ip of the control panel node

    192.168.56.11 the ip of the worker node

    !!! I had to edit the k8scp.sh. I added "--apiserver-advertise-address=192.168.56.10" in the kubeadm init command otherwise the worker node could not join. And then manually copy the .kube/config from the cp-node to the worker in order to be able to use the api.

    Cilium doesn't need to be installed in every node ?

    Let me know if you need more info.

  • yorgji
    yorgji Posts: 11
    edited February 6

    Thanks in advance. I posted it twice by mistake.

  • chrispokorni
    chrispokorni Posts: 2,434

    Hi @yorgji,

    I was unable to open the attached pdf document; please provide a txt file instead - it is a safer attachment format. Since vagrant is a personal choice, feel free to make adjustments as necessary to implement the following recommendations.

    On the VirtualBox hypervisor, please ensure that:

    • each VM is provisioned with only one bridged network adapter, with promiscuous mode enabled to allow all incoming traffic
    • the 40GB vdisk is fully allocated, and not dynamic. All this needs to be explicitly declared. Assumptions with "probably" do not lead to successful outcomes

    Editing the kubeadm-config.yaml manifest:

    • the controlPlaneEndpoint entry is recommended over the --apiserver-advertise-address option due to its flexibility when it comes to HA cluster configuration in a later chapter. Ensure the .../hosts file is correctly populated with control plane VM IP address and the k8scp alias (same entry on both VMs)
    • in order to avoid IP range overlapping, change the subnet value as such podSubnet: 10.200.0.0/16

    Editing the cilium-cni.yaml manifest before launching the Cilium CNI:

    • around line 222 update the cidr value as such cluster-pool-ipv4-cidr: "10.200.0.0/16"

    Cilium doesn't need to be installed in every node ?

    No.

    And then manually copy the .kube/config from the cp-node to the worker in order to be able to use the api.

    Not a recommended step, but not a showstopper either.

    Regards,
    -Chris

  • yorgji
    yorgji Posts: 11
    edited February 6

    Hello @chrispokorni,
    Thanks for the help but it seems way too complicated.
    Is there a tutorial to setup the environment locally using VMs instead of the using a Cloud provider?

    1. kubectl get nodes -o wide
      NAME STATUS ROLES AGE VERSION INTERNAL-IP
      EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
      cp-node Ready control-plane 18m v1.31.1 10.0.2.15
      Ubuntu 20.04.6 LTS 5.4.0-205-generic containerd://1.7.25
      worker-node NotReady 9m9s v1.31.1 10.0.2.15
      Ubuntu 20.04.6 LTS 5.4.0-205-generic containerd://1.7.25

    2. kubectl get pods -A -o wide
      NAMESPACE NAME READY STATUS
      RESTARTS AGE IP NODE NOMINATED NODE
      READINESS GATES
      kube-system cilium-bpdkv 0/1 Init:0/6 6 (46s
      ago) 9m44s 10.0.2.15 worker-node
      kube-system cilium-envoy-2nnb5 ago) 9m44s 10.0.2.15 worker-node 1/1 Running 1 (4m51s

      kube-system cilium-envoy-frtsz 1/1 Running 0
      19m 10.0.2.15 cp-node
      kube-system cilium-kkjtm 1/1 Running 0
      19m 10.0.2.15 cp-node
      kube-system cilium-operator-5c7867ccd5-7zvvd 1/1 Running 019m 10.0.2.15 cp-node
      kube-system coredns-7c65d6cfc9-d99tl 1/1 Running 0
      19m 10.0.0.30 cp-node
      kube-system coredns-7c65d6cfc9-tpvd5 1/1 Running 0
      19m 10.0.0.123 cp-node
      kube-system etcd-cp-node 1/1 Running 0
      19m 10.0.2.15 cp-node
      kube-system kube-apiserver-cp-node 1/1 Running 0
      19m 10.0.2.15 cp-node
      kube-system kube-controller-manager-cp-node 1/1 Running 0
      19m 10.0.2.15 cp-node
      kube-system kube-proxy-lvkpc 1/1 Running 0
      19m 10.0.2.15 cp-node
      kube-system kube-proxy-m5ntc ago) 9m44s 10.0.2.15 worker-node 1/1 Running 1 (4m51s

      kube-system kube-scheduler-cp-node 1/1 Running 0
      19m 10.0.2.15 cp-node

    3. kubectl -n kube-system describe pod cilium-bpdkv
      Name: cilium-bpdkv
      Namespace: kube-system
      Priority: 2000001000
      Priority Class Name: system-node-critical
      ...
      Normal Scheduled 10m default-scheduler
      Successfully assigned kube-system/cilium-bpdkv to worker-node
      Normal Pulling 10m kubelet Pulling
      image
      "quay.io/cilium/cilium:v1.16.1@sha256:0b4a3ab41a4760d86b7fc945b8783747ba27f2
      9dac30dd434d94f2c9e3679f39"Normal Pulled 9m18s kubelet
      Successfully pulled image
      "quay.io/cilium/cilium:v1.16.1@sha256:0b4a3ab41a4760d86b7fc945b8783747ba27f2
      9dac30dd434d94f2c9e3679f39" in 33.006s (48.034s including waiting). Image
      size: 223002645 bytes.
      Warning BackOff 7m6s kubelet Back-
      off restarting failed container config in pod cilium-bpdkv_kube-
      system(aa383527-9bec-4324-9f80-b8b676126a5a)
      Normal Created 6m51s (x3 over 9m18s) kubelet Created
      container config
      Normal Started 6m51s (x3 over 9m18s) kubelet Started
      container config
      Normal Pulled 6m51s (x2 over 8m12s) kubelet
      Container image
      "quay.io/cilium/cilium:v1.16.1@sha256:0b4a3ab41a4760d86b7fc945b8783747ba27f2
      9dac30dd434d94f2c9e3679f39" already present on machine
      Normal SandboxChanged 5m17s kubelet Pod
      sandbox changed, it will be killed and re-created.
      Warning BackOff 47s (x6 over 4m10s) kubelet Back-
      off restarting failed container config in pod cilium-bpdkv_kube-
      system(aa383527-9bec-4324-9f80-b8b676126a5a)
      Normal Pulled 33s (x4 over 5m17s) kubelet
      Container image
      "quay.io/cilium/cilium:v1.16.1@sha256:0b4a3ab41a4760d86b7fc945b8783747ba27f2
      9dac30dd434d94f2c9e3679f39" already present on machine
      Normal Created 33s (x4 over 5m16s) kubelet Created
      container config
      Normal Started 33s (x4 over 5m16s) kubelet Started
      container config

  • chrispokorni
    chrispokorni Posts: 2,434

    Hi @yorgji,

    Video guides have only been generated for the AWS and GCP cloud providers.
    Based on the output provided above, your worker node is NotReady, a state which prevents your cluster from fully initializing and deploying any workload. Unfortunately there is no band-aid solution here; a correctly configured cluster is required to move forward and successfully complete this training.

    As far as VirtualBox, its easy to use interface allows me to quickly provision the two required VMs to the specs described above. Following all the installation and config steps from the lab guide, with few custom edits mentioned above, should yield a fully functional cluster. I have done this countless times. You only have to build the cluster once, then stop the VMs when not actively working on lab exercises.

    Regards,
    -Chris

  • yorgji
    yorgji Posts: 11
    edited February 13

    Hello @chrispokorni ,
    So after I run the k8scp.sh without any modifications :
    1. set the controlPlaneEndpoint to 192.168.56.10(control plane ip)
    2. and the /etc/hosts file in each node should contain

    192.168.56.10 k8scp
    
    192.168.56.10 cp-node
    

    ?
    3. And then update the cluster-pool-ipv4-cidr to cluster-pool-ipv4-cidr: "10.200.0.0/16" ? and if yes in which file ?

  • chrispokorni
    chrispokorni Posts: 2,434

    Hi @yorgji,

    Here are the details from my earlier response to you, with a minor edit for clarification:

    On the VirtualBox hypervisor, please ensure that:

    • each VM is provisioned with only one bridged network adapter, with promiscuous mode enabled to allow all incoming traffic
    • the 40GB vdisk is fully allocated, and not dynamic. All this needs to be explicitly declared. Assumptions with "probably" do not lead to successful outcomes

    Editing the kubeadm-config.yaml manifest:

    • the controlPlaneEndpoint: "k8scp:6443" entry is recommended over the --apiserver-advertise-address option due to its flexibility when it comes to HA cluster configuration in a later chapter. Ensure the .../hosts file is correctly populated with control plane VM IP address and the k8scp alias (same entry on both VMs)
    • in order to avoid IP range overlapping, change the subnet value as such podSubnet: 10.200.0.0/16

    Editing the cilium-cni.yaml manifest before launching the Cilium CNI:

    • around line 222 update the cidr value as such cluster-pool-ipv4-cidr: "10.200.0.0/16"

    Regards,
    -Chris

  • yorgji
    yorgji Posts: 11
    edited February 13

    Hello @chrispokorni ,

    I really appreciate your help.

    I updated the kubeadm init command in k8scp.sh file ->

    sudo kubeadm init --control-plane-endpoint "k8scp:6443" --pod-network-cidr=10.200.0.0/16
    

    I also updated the cilium install command in k8scp.sh file ->

    cilium install --version 1.16.1 \
      --set k8sServiceHost=k8scp \
      --set k8sServicePort=6443 \
      --set ipam.operator.clusterPoolIPv4PodCIDRList=10.200.0.0/16
    

    and I updated the /etc/hosts as per your instruction.

    And there is an improvement, now both nodes are ready but cilium still reports errors; like this one :

    cilium             cilium-zpzqf    unable to retrieve cilium endpoint information: command failed (pod=kube-system/cilium-zpzqf, container=cilium-agent): unable to upgrade connection: pod does not exist
    

    but it is running :

    cilium-zpzqf   1/1     Running   0               13m
    
  • yorgji
    yorgji Posts: 11
    edited February 17

    This is the result of the kubectl describe pod cilium-zpzqf :

     Warning  Unhealthy       24m (x3 over 24m)      kubelet            Startup probe failed: Get "http://127.0.0.1:9879/healthz": dial tcp 127.0.0.1:9879: connect: connection refused
    
  • yorgji
    yorgji Posts: 11
    edited February 17

    Turns out that one of the coredns pods is failing too : Readiness probe failed: HTTP probe failed with statuscode: 503

  • chrispokorni
    chrispokorni Posts: 2,434

    Hi @yorgji,

    There are a few things that puzzle me from your most recent comments.
    Please clarify for me what course you are following - LFS258 Kubernetes Fundamentals or LFD259 Kubernetes for Developers?

    This forum is for the LFS258 course, yet some of your remarks are hinting that you are enrolled in a different course, the LFD259. The cluster bootstrapping is different between the 2 courses, and the troubleshooting help we provide here is very targeted.

    Regards,
    -Chris

  • yorgji
    yorgji Posts: 11
    edited February 18

    Hello @chrispokorni,
    You are right, I'm enrolled in LFD259 Kubernetes for Developers.
    Sorry about that, I didn't notice it.

  • chrispokorni
    chrispokorni Posts: 2,434

    Hi @yorgji,

    In that case you do not need the k8scp alias entry in any of the hosts files, and remove it from any command if you supplied it as an argument. Perhaps populating the hosts files on both machines with the cp and worker hostnames and their respective private IP addresses will also help.

    Regards,
    -Chris

Categories

Upcoming Training