Lab 3.1: Cilium helm install crashloop and no interfaces

izzl · April 2024

Hi there, using kubernetes 1.28.1 and cilium appear to have some pods working but has crashloop. The network interfaces aren't visible and prevents me from progressing to the next step.

Do you have any suggestions for trouble shooting? So far I've tried, helm install and using the provided yaml:

kubectl get pods --all-namespaces
NAMESPACE     NAME                               READY   STATUS     RESTARTS      AGE
kube-system   cilium-6prb4                       1/1     Running    0             3m46s
kube-system   cilium-operator-788c4f69bc-52vpc   0/1     Running    2 (84s ago)   3m46s
kube-system   cilium-operator-788c4f69bc-7m68z   1/1     Running    0             3m46s
kube-system   cilium-pbl26                       0/1     Init:0/6   0             25s
kube-system   coredns-5dd5756b68-d5j9z           1/1     Running    0             17m
kube-system   coredns-5dd5756b68-pv9tb           1/1     Running    0             17m
kube-system   etcd-k8scp                         1/1     Running    0             17m
kube-system   kube-apiserver-k8scp               1/1     Running    0             17m
kube-system   kube-controller-manager-k8scp      1/1     Running    0             17m
kube-system   kube-proxy-n6nfj                   1/1     Running    0             17m
kube-system   kube-proxy-trn2d                   1/1     Running    0             4m27s
kube-system   kube-scheduler-k8scp               1/1     Running    0             17m

chrispokorni · April 2024

Hi @izzl,

The lab guide follows a hardware+software recipe that allows learners to successfully complete the lab exercises on various infrastructures, that is either a major cloud provider (AWS, Azure, GCP, Digital Ocean, IBM Cloud, etc.), or a local hypervisor (VirtualBox, KVM, etc). Deviating from the recommended ingredients will require the learner to put in additional work to bring the environment to its desired state.

In your case, the VM sizes do not seem to cause any issues, as they supply more than the necessary CPU, mem and disk that would be needed for the lab exercises. However, Ubuntu 22.04 LTS is known to cause some networking issues, that is why we still recommend 20.04 LTS - which is also in sync with the CKA certification exam environment.

Provided that the recipe is followed from the very beginning and the desired infrastructure is provisioned and configured as recommended by the lab guide, the leaner should not have to perform installation or configuration that is not covered in the lab guide, which may be outside of the training's scope.

Regards,
-Chris

izzl · April 2024

I'm almost 100% its a network issue of the worker pod unable to talk to the control plane. Here are logs from cilium-operator-788c4f69bc-52vpc:

level=info msg=Starting subsys=hive
level=info msg="Started gops server" address="127.0.0.1:9891" subsys=gops
level=info msg="Start hook executed" duration="289.791µs" function="gops.registerGopsHooks.func1 (cell.go:44)" subsys=hive
level=info msg="Establishing connection to apiserver" host="https://10.96.0.1:443" subsys=k8s-client
level=info msg="Establishing connection to apiserver" host="https://10.96.0.1:443" subsys=k8s-client
level=error msg="Unable to contact k8s api-server" error="Get \"https://10.96.0.1:443/api/v1/namespaces/kube-system\": dial tcp 10.96.0.1:443: i/o timeout" ipAddr="https://10.96.0.1:443" subsys=k8s-client
level=error msg="Start hook failed" error="Get \"https://10.96.0.1:443/api/v1/namespaces/kube-system\": dial tcp 10.96.0.1:443: i/o timeout" function="client.(*compositeClientset).onStart" subsys=hive
level=info msg=Stopping subsys=hive
level=info msg="Stopped gops server" address="127.0.0.1:9891" subsys=gops
level=info msg="Stop hook executed" duration="175.964µs" function="gops.registerGopsHooks.func2 (cell.go:51)" subsys=hive
level=fatal msg="failed to start: Get \"https://10.96.0.1:443/api/v1/namespaces/kube-system\": dial tcp 10.96.0.1:443: i/o timeout" subsys=cilium-operator-generic

But this other operator cilium-operator-788c4f69bc-7m68z is running fine and is successfully setup so unsure why the network issue only exists on other replicas
:

# excluded previous successful logs
level=info msg="Establishing connection to apiserver" host="https://10.96.0.1:443" subsys=k8s-client
level=info msg="Connected to apiserver" subsys=k8s-client
level=info msg="Start hook executed" duration=11.366823ms function="client.(*compositeClientset).onStart" subsys=hive
level=info msg="Start hook executed" duration="10.36µs" function="cmd.registerOperatorHooks.func1 (root.go:159)" subsys=hive
level=info msg="Waiting for leader election" subsys=cilium-operator-generic
level=info msg="attempting to acquire leader lease kube-system/cilium-operator-resource-lock..." subsys=klog
level=info msg="successfully acquired lease kube-system/cilium-operator-resource-lock" subsys=klog
level=info msg="Leading the operator HA deployment" subsys=cilium-operator-generic
level=info msg="Start hook executed" duration=11.785021ms function="*api.server.Start" subsys=hive
# excluded following successful logs

chrispokorni · April 2024

Hi @izzl,

I'm almost 100% its a network issue

You are correct. When network related components fail to start, typically the infrastructure network is improperly configured.

Have you tried rebooting the VMs? At times the reboot helps to reset or apply necessary changes that enable traffic between components.

What type of infrastructure is hosting your cluster? What cloud or local hypervisor? What are the VM sizes (cpu, mem, disk), the guest OS, type of network interfaces (host, bridge, nat, ...)? How is the network set up? What about firewalls or proxy?

The introductory chapter presents two video guides to configure infrastructure on GCP and AWS, focusing on VPC network configuration and necessary firewalls and security groups respectively that would enable communication between all Kubernetes cluster components. Similar configuration options should be applied to other cloud providers and on-premises hypervisors.

Regards,
-Chris

izzl · April 2024

I'm hosting on Ubuntu22.04 LTS in GCP with e2-standard-8 which has 8 CPU, 32 GB, 100GB disk each. I've got one control plane and one worker node.

I've got them provisioned on the same vpc and got it working once by installing helm and updating cilium. What might I be doing wrong? I've also tried using provided cilium-cni.yaml, using helm template command for cilium and using latest cilum. I'm happy to share my scripts if that helps.

izzl · April 2024

The only way I was able to get this to work was to install with helm and pin the service host:

helm repo add cilium https://helm.cilium.io/
helm repo update
helm upgrade --install cilium cilium/cilium --version 1.14.1 \
    --namespace kube-system \
    --set kubeProxyReplacement=strict \
    --set k8sServiceHost=$internal_ip \
    --set k8sServicePort=6443

izzl · April 2024

You are right, I went back to the start and ensured all steps were followed in order. I then got everything automated in bash script and it works fine now when running.

ezequiel.rodrigue · May 2024

I had somewhat similar network issues (on Digital Ocean). Resolved now when I changed the OS from Ubuntu 22.04 to 20.04... thank you @chrispokorni

PS: It would probably be beneficial to add a note on the lab pdf of what you mentioned above.

chrispokorni · May 2024

Hi @ezequiel.rodrigue,

The Installation and Configuration page of Ch3 in the lab guide does mention that the labs have been compiled on Ubuntu 20.04. Sometimes, not always though, deviating from an explicit version will result in discrepancies or misbehaviors. Especially in these scenarios compatibility should be carefully checked between components of the software stack to ensure the environment is properly installed and configured.

Regards,
-Chris

ezequiel.rodrigue · May 2024

Hi @chrispokorni,

Yes, I noticed that the OS version was specified in the guide. My only suggestion was that a mention was added explaining what you said above:

[...] However, Ubuntu 22.04 LTS is known to cause some networking issues, that is why we still recommend 20.04 LTS

Lab 3.1: Cilium helm install crashloop and no interfaces

Best Answer

Answers

Categories

Upcoming Training

Kubernetes Administration (LFS458)

Linux System Administration (LFS301)

Open Source Virtualization (LFS462)

Linux Kernel Debugging and Security (LFD440)