Welcome to the Linux Foundation Forum!

Lab 3.1: Cilium helm install crashloop and no interfaces

Hi there, using kubernetes 1.28.1 and cilium appear to have some pods working but has crashloop. The network interfaces aren't visible and prevents me from progressing to the next step.

Do you have any suggestions for trouble shooting? So far I've tried, helm install and using the provided yaml:

  1. kubectl get pods --all-namespaces
  2. NAMESPACE NAME READY STATUS RESTARTS AGE
  3. kube-system cilium-6prb4 1/1 Running 0 3m46s
  4. kube-system cilium-operator-788c4f69bc-52vpc 0/1 Running 2 (84s ago) 3m46s
  5. kube-system cilium-operator-788c4f69bc-7m68z 1/1 Running 0 3m46s
  6. kube-system cilium-pbl26 0/1 Init:0/6 0 25s
  7. kube-system coredns-5dd5756b68-d5j9z 1/1 Running 0 17m
  8. kube-system coredns-5dd5756b68-pv9tb 1/1 Running 0 17m
  9. kube-system etcd-k8scp 1/1 Running 0 17m
  10. kube-system kube-apiserver-k8scp 1/1 Running 0 17m
  11. kube-system kube-controller-manager-k8scp 1/1 Running 0 17m
  12. kube-system kube-proxy-n6nfj 1/1 Running 0 17m
  13. kube-system kube-proxy-trn2d 1/1 Running 0 4m27s
  14. kube-system kube-scheduler-k8scp 1/1 Running 0 17m

Welcome!

It looks like you're new here. Sign in or register to get started.
Sign In

Best Answer

  • Posts: 2,443
    edited May 2024 Answer ✓

    Hi @izzl,

    The lab guide follows a hardware+software recipe that allows learners to successfully complete the lab exercises on various infrastructures, that is either a major cloud provider (AWS, Azure, GCP, Digital Ocean, IBM Cloud, etc.), or a local hypervisor (VirtualBox, KVM, etc). Deviating from the recommended ingredients will require the learner to put in additional work to bring the environment to its desired state.

    In your case, the VM sizes do not seem to cause any issues, as they supply more than the necessary CPU, mem and disk that would be needed for the lab exercises. However, Ubuntu 22.04 LTS is known to cause some networking issues, that is why we still recommend 20.04 LTS - which is also in sync with the CKA certification exam environment.

    Provided that the recipe is followed from the very beginning and the desired infrastructure is provisioned and configured as recommended by the lab guide, the leaner should not have to perform installation or configuration that is not covered in the lab guide, which may be outside of the training's scope.

    Regards,
    -Chris

Answers

  • Posts: 7
    edited April 2024

    I'm almost 100% its a network issue of the worker pod unable to talk to the control plane. Here are logs from cilium-operator-788c4f69bc-52vpc:

    1. level=info msg=Starting subsys=hive
    2. level=info msg="Started gops server" address="127.0.0.1:9891" subsys=gops
    3. level=info msg="Start hook executed" duration="289.791µs" function="gops.registerGopsHooks.func1 (cell.go:44)" subsys=hive
    4. level=info msg="Establishing connection to apiserver" host="https://10.96.0.1:443" subsys=k8s-client
    5. level=info msg="Establishing connection to apiserver" host="https://10.96.0.1:443" subsys=k8s-client
    6. level=error msg="Unable to contact k8s api-server" error="Get \"https://10.96.0.1:443/api/v1/namespaces/kube-system\": dial tcp 10.96.0.1:443: i/o timeout" ipAddr="https://10.96.0.1:443" subsys=k8s-client
    7. level=error msg="Start hook failed" error="Get \"https://10.96.0.1:443/api/v1/namespaces/kube-system\": dial tcp 10.96.0.1:443: i/o timeout" function="client.(*compositeClientset).onStart" subsys=hive
    8. level=info msg=Stopping subsys=hive
    9. level=info msg="Stopped gops server" address="127.0.0.1:9891" subsys=gops
    10. level=info msg="Stop hook executed" duration="175.964µs" function="gops.registerGopsHooks.func2 (cell.go:51)" subsys=hive
    11. level=fatal msg="failed to start: Get \"https://10.96.0.1:443/api/v1/namespaces/kube-system\": dial tcp 10.96.0.1:443: i/o timeout" subsys=cilium-operator-generic

    But this other operator cilium-operator-788c4f69bc-7m68z is running fine and is successfully setup so unsure why the network issue only exists on other replicas
    :

    1. # excluded previous successful logs
    2. level=info msg="Establishing connection to apiserver" host="https://10.96.0.1:443" subsys=k8s-client
    3. level=info msg="Connected to apiserver" subsys=k8s-client
    4. level=info msg="Start hook executed" duration=11.366823ms function="client.(*compositeClientset).onStart" subsys=hive
    5. level=info msg="Start hook executed" duration="10.36µs" function="cmd.registerOperatorHooks.func1 (root.go:159)" subsys=hive
    6. level=info msg="Waiting for leader election" subsys=cilium-operator-generic
    7. level=info msg="attempting to acquire leader lease kube-system/cilium-operator-resource-lock..." subsys=klog
    8. level=info msg="successfully acquired lease kube-system/cilium-operator-resource-lock" subsys=klog
    9. level=info msg="Leading the operator HA deployment" subsys=cilium-operator-generic
    10. level=info msg="Start hook executed" duration=11.785021ms function="*api.server.Start" subsys=hive
    11. # excluded following successful logs
  • Posts: 2,443

    Hi @izzl,

    I'm almost 100% its a network issue

    You are correct. When network related components fail to start, typically the infrastructure network is improperly configured.

    Have you tried rebooting the VMs? At times the reboot helps to reset or apply necessary changes that enable traffic between components.

    What type of infrastructure is hosting your cluster? What cloud or local hypervisor? What are the VM sizes (cpu, mem, disk), the guest OS, type of network interfaces (host, bridge, nat, ...)? How is the network set up? What about firewalls or proxy?

    The introductory chapter presents two video guides to configure infrastructure on GCP and AWS, focusing on VPC network configuration and necessary firewalls and security groups respectively that would enable communication between all Kubernetes cluster components. Similar configuration options should be applied to other cloud providers and on-premises hypervisors.

    Regards,
    -Chris

  • Posts: 7
    edited April 2024

    I'm hosting on Ubuntu22.04 LTS in GCP with e2-standard-8 which has 8 CPU, 32 GB, 100GB disk each. I've got one control plane and one worker node.

    I've got them provisioned on the same vpc and got it working once by installing helm and updating cilium. What might I be doing wrong? I've also tried using provided cilium-cni.yaml, using helm template command for cilium and using latest cilum. I'm happy to share my scripts if that helps.

  • Posts: 7
    edited April 2024

    The only way I was able to get this to work was to install with helm and pin the service host:

    1. helm repo add cilium https://helm.cilium.io/
    2. helm repo update
    3. helm upgrade --install cilium cilium/cilium --version 1.14.1 \
    4. --namespace kube-system \
    5. --set kubeProxyReplacement=strict \
    6. --set k8sServiceHost=$internal_ip \
    7. --set k8sServicePort=6443
  • Posts: 7

    You are right, I went back to the start and ensured all steps were followed in order. I then got everything automated in bash script and it works fine now when running.

  • I had somewhat similar network issues (on Digital Ocean). Resolved now when I changed the OS from Ubuntu 22.04 to 20.04... thank you @chrispokorni

    PS: It would probably be beneficial to add a note on the lab pdf of what you mentioned above.

  • Posts: 2,443

    Hi @ezequiel.rodrigue,

    The Installation and Configuration page of Ch3 in the lab guide does mention that the labs have been compiled on Ubuntu 20.04. Sometimes, not always though, deviating from an explicit version will result in discrepancies or misbehaviors. Especially in these scenarios compatibility should be carefully checked between components of the software stack to ensure the environment is properly installed and configured.

    Regards,
    -Chris

  • Hi @chrispokorni,

    Yes, I noticed that the OS version was specified in the guide. My only suggestion was that a mention was added explaining what you said above:

    [...] However, Ubuntu 22.04 LTS is known to cause some networking issues, that is why we still recommend 20.04 LTS

Welcome!

It looks like you're new here. Sign in or register to get started.
Sign In

Welcome!

It looks like you're new here. Sign in or register to get started.
Sign In

Categories

Upcoming Training