Welcome to the Linux Foundation Forum!

Lab 3.1: Cilium helm install crashloop and no interfaces

Hi there, using kubernetes 1.28.1 and cilium appear to have some pods working but has crashloop. The network interfaces aren't visible and prevents me from progressing to the next step.

Do you have any suggestions for trouble shooting? So far I've tried, helm install and using the provided yaml:

kubectl get pods --all-namespaces
NAMESPACE     NAME                               READY   STATUS     RESTARTS      AGE
kube-system   cilium-6prb4                       1/1     Running    0             3m46s
kube-system   cilium-operator-788c4f69bc-52vpc   0/1     Running    2 (84s ago)   3m46s
kube-system   cilium-operator-788c4f69bc-7m68z   1/1     Running    0             3m46s
kube-system   cilium-pbl26                       0/1     Init:0/6   0             25s
kube-system   coredns-5dd5756b68-d5j9z           1/1     Running    0             17m
kube-system   coredns-5dd5756b68-pv9tb           1/1     Running    0             17m
kube-system   etcd-k8scp                         1/1     Running    0             17m
kube-system   kube-apiserver-k8scp               1/1     Running    0             17m
kube-system   kube-controller-manager-k8scp      1/1     Running    0             17m
kube-system   kube-proxy-n6nfj                   1/1     Running    0             17m
kube-system   kube-proxy-trn2d                   1/1     Running    0             4m27s
kube-system   kube-scheduler-k8scp               1/1     Running    0             17m

Best Answer

  • chrispokorni
    chrispokorni Posts: 2,274
    edited May 9 Answer ✓

    Hi @izzl,

    The lab guide follows a hardware+software recipe that allows learners to successfully complete the lab exercises on various infrastructures, that is either a major cloud provider (AWS, Azure, GCP, Digital Ocean, IBM Cloud, etc.), or a local hypervisor (VirtualBox, KVM, etc). Deviating from the recommended ingredients will require the learner to put in additional work to bring the environment to its desired state.

    In your case, the VM sizes do not seem to cause any issues, as they supply more than the necessary CPU, mem and disk that would be needed for the lab exercises. However, Ubuntu 22.04 LTS is known to cause some networking issues, that is why we still recommend 20.04 LTS - which is also in sync with the CKA certification exam environment.

    Provided that the recipe is followed from the very beginning and the desired infrastructure is provisioned and configured as recommended by the lab guide, the leaner should not have to perform installation or configuration that is not covered in the lab guide, which may be outside of the training's scope.

    Regards,
    -Chris

Answers

  • izzl
    izzl Posts: 7
    edited April 17

    I'm almost 100% its a network issue of the worker pod unable to talk to the control plane. Here are logs from cilium-operator-788c4f69bc-52vpc:

    level=info msg=Starting subsys=hive
    level=info msg="Started gops server" address="127.0.0.1:9891" subsys=gops
    level=info msg="Start hook executed" duration="289.791µs" function="gops.registerGopsHooks.func1 (cell.go:44)" subsys=hive
    level=info msg="Establishing connection to apiserver" host="https://10.96.0.1:443" subsys=k8s-client
    level=info msg="Establishing connection to apiserver" host="https://10.96.0.1:443" subsys=k8s-client
    level=error msg="Unable to contact k8s api-server" error="Get \"https://10.96.0.1:443/api/v1/namespaces/kube-system\": dial tcp 10.96.0.1:443: i/o timeout" ipAddr="https://10.96.0.1:443" subsys=k8s-client
    level=error msg="Start hook failed" error="Get \"https://10.96.0.1:443/api/v1/namespaces/kube-system\": dial tcp 10.96.0.1:443: i/o timeout" function="client.(*compositeClientset).onStart" subsys=hive
    level=info msg=Stopping subsys=hive
    level=info msg="Stopped gops server" address="127.0.0.1:9891" subsys=gops
    level=info msg="Stop hook executed" duration="175.964µs" function="gops.registerGopsHooks.func2 (cell.go:51)" subsys=hive
    level=fatal msg="failed to start: Get \"https://10.96.0.1:443/api/v1/namespaces/kube-system\": dial tcp 10.96.0.1:443: i/o timeout" subsys=cilium-operator-generic
    

    But this other operator cilium-operator-788c4f69bc-7m68z is running fine and is successfully setup so unsure why the network issue only exists on other replicas
    :

    # excluded previous successful logs
    level=info msg="Establishing connection to apiserver" host="https://10.96.0.1:443" subsys=k8s-client
    level=info msg="Connected to apiserver" subsys=k8s-client
    level=info msg="Start hook executed" duration=11.366823ms function="client.(*compositeClientset).onStart" subsys=hive
    level=info msg="Start hook executed" duration="10.36µs" function="cmd.registerOperatorHooks.func1 (root.go:159)" subsys=hive
    level=info msg="Waiting for leader election" subsys=cilium-operator-generic
    level=info msg="attempting to acquire leader lease kube-system/cilium-operator-resource-lock..." subsys=klog
    level=info msg="successfully acquired lease kube-system/cilium-operator-resource-lock" subsys=klog
    level=info msg="Leading the operator HA deployment" subsys=cilium-operator-generic
    level=info msg="Start hook executed" duration=11.785021ms function="*api.server.Start" subsys=hive
    # excluded following successful logs
    
  • chrispokorni
    chrispokorni Posts: 2,274

    Hi @izzl,

    I'm almost 100% its a network issue

    You are correct. When network related components fail to start, typically the infrastructure network is improperly configured.

    Have you tried rebooting the VMs? At times the reboot helps to reset or apply necessary changes that enable traffic between components.

    What type of infrastructure is hosting your cluster? What cloud or local hypervisor? What are the VM sizes (cpu, mem, disk), the guest OS, type of network interfaces (host, bridge, nat, ...)? How is the network set up? What about firewalls or proxy?

    The introductory chapter presents two video guides to configure infrastructure on GCP and AWS, focusing on VPC network configuration and necessary firewalls and security groups respectively that would enable communication between all Kubernetes cluster components. Similar configuration options should be applied to other cloud providers and on-premises hypervisors.

    Regards,
    -Chris

  • izzl
    izzl Posts: 7
    edited April 19

    I'm hosting on Ubuntu22.04 LTS in GCP with e2-standard-8 which has 8 CPU, 32 GB, 100GB disk each. I've got one control plane and one worker node.

    I've got them provisioned on the same vpc and got it working once by installing helm and updating cilium. What might I be doing wrong? I've also tried using provided cilium-cni.yaml, using helm template command for cilium and using latest cilum. I'm happy to share my scripts if that helps.

  • izzl
    izzl Posts: 7
    edited April 19

    The only way I was able to get this to work was to install with helm and pin the service host:

    helm repo add cilium https://helm.cilium.io/
    helm repo update
    helm upgrade --install cilium cilium/cilium --version 1.14.1 \
        --namespace kube-system \
        --set kubeProxyReplacement=strict \
        --set k8sServiceHost=$internal_ip \
        --set k8sServicePort=6443
    
  • izzl
    izzl Posts: 7

    You are right, I went back to the start and ensured all steps were followed in order. I then got everything automated in bash script and it works fine now when running.

  • I had somewhat similar network issues (on Digital Ocean). Resolved now when I changed the OS from Ubuntu 22.04 to 20.04... thank you @chrispokorni

    PS: It would probably be beneficial to add a note on the lab pdf of what you mentioned above.

  • chrispokorni
    chrispokorni Posts: 2,274

    Hi @ezequiel.rodrigue,

    The Installation and Configuration page of Ch3 in the lab guide does mention that the labs have been compiled on Ubuntu 20.04. Sometimes, not always though, deviating from an explicit version will result in discrepancies or misbehaviors. Especially in these scenarios compatibility should be carefully checked between components of the software stack to ensure the environment is properly installed and configured.

    Regards,
    -Chris

  • Hi @chrispokorni,

    Yes, I noticed that the OS version was specified in the guide. My only suggestion was that a mention was added explaining what you said above:

    [...] However, Ubuntu 22.04 LTS is known to cause some networking issues, that is why we still recommend 20.04 LTS

Categories

Upcoming Training