Lab 3.1: Cilium helm install crashloop and no interfaces
Hi there, using kubernetes 1.28.1 and cilium appear to have some pods working but has crashloop. The network interfaces aren't visible and prevents me from progressing to the next step.
Do you have any suggestions for trouble shooting? So far I've tried, helm install and using the provided yaml:
kubectl get pods --all-namespaces NAMESPACE NAME READY STATUS RESTARTS AGE kube-system cilium-6prb4 1/1 Running 0 3m46s kube-system cilium-operator-788c4f69bc-52vpc 0/1 Running 2 (84s ago) 3m46s kube-system cilium-operator-788c4f69bc-7m68z 1/1 Running 0 3m46s kube-system cilium-pbl26 0/1 Init:0/6 0 25s kube-system coredns-5dd5756b68-d5j9z 1/1 Running 0 17m kube-system coredns-5dd5756b68-pv9tb 1/1 Running 0 17m kube-system etcd-k8scp 1/1 Running 0 17m kube-system kube-apiserver-k8scp 1/1 Running 0 17m kube-system kube-controller-manager-k8scp 1/1 Running 0 17m kube-system kube-proxy-n6nfj 1/1 Running 0 17m kube-system kube-proxy-trn2d 1/1 Running 0 4m27s kube-system kube-scheduler-k8scp 1/1 Running 0 17m
Best Answer
-
Hi @izzl,
The lab guide follows a hardware+software recipe that allows learners to successfully complete the lab exercises on various infrastructures, that is either a major cloud provider (AWS, Azure, GCP, Digital Ocean, IBM Cloud, etc.), or a local hypervisor (VirtualBox, KVM, etc). Deviating from the recommended ingredients will require the learner to put in additional work to bring the environment to its desired state.
In your case, the VM sizes do not seem to cause any issues, as they supply more than the necessary CPU, mem and disk that would be needed for the lab exercises. However, Ubuntu 22.04 LTS is known to cause some networking issues, that is why we still recommend 20.04 LTS - which is also in sync with the CKA certification exam environment.
Provided that the recipe is followed from the very beginning and the desired infrastructure is provisioned and configured as recommended by the lab guide, the leaner should not have to perform installation or configuration that is not covered in the lab guide, which may be outside of the training's scope.
Regards,
-Chris0
Answers
-
I'm almost 100% its a network issue of the worker pod unable to talk to the control plane. Here are logs from cilium-operator-788c4f69bc-52vpc:
level=info msg=Starting subsys=hive level=info msg="Started gops server" address="127.0.0.1:9891" subsys=gops level=info msg="Start hook executed" duration="289.791µs" function="gops.registerGopsHooks.func1 (cell.go:44)" subsys=hive level=info msg="Establishing connection to apiserver" host="https://10.96.0.1:443" subsys=k8s-client level=info msg="Establishing connection to apiserver" host="https://10.96.0.1:443" subsys=k8s-client level=error msg="Unable to contact k8s api-server" error="Get \"https://10.96.0.1:443/api/v1/namespaces/kube-system\": dial tcp 10.96.0.1:443: i/o timeout" ipAddr="https://10.96.0.1:443" subsys=k8s-client level=error msg="Start hook failed" error="Get \"https://10.96.0.1:443/api/v1/namespaces/kube-system\": dial tcp 10.96.0.1:443: i/o timeout" function="client.(*compositeClientset).onStart" subsys=hive level=info msg=Stopping subsys=hive level=info msg="Stopped gops server" address="127.0.0.1:9891" subsys=gops level=info msg="Stop hook executed" duration="175.964µs" function="gops.registerGopsHooks.func2 (cell.go:51)" subsys=hive level=fatal msg="failed to start: Get \"https://10.96.0.1:443/api/v1/namespaces/kube-system\": dial tcp 10.96.0.1:443: i/o timeout" subsys=cilium-operator-generic
But this other operator cilium-operator-788c4f69bc-7m68z is running fine and is successfully setup so unsure why the network issue only exists on other replicas
:# excluded previous successful logs level=info msg="Establishing connection to apiserver" host="https://10.96.0.1:443" subsys=k8s-client level=info msg="Connected to apiserver" subsys=k8s-client level=info msg="Start hook executed" duration=11.366823ms function="client.(*compositeClientset).onStart" subsys=hive level=info msg="Start hook executed" duration="10.36µs" function="cmd.registerOperatorHooks.func1 (root.go:159)" subsys=hive level=info msg="Waiting for leader election" subsys=cilium-operator-generic level=info msg="attempting to acquire leader lease kube-system/cilium-operator-resource-lock..." subsys=klog level=info msg="successfully acquired lease kube-system/cilium-operator-resource-lock" subsys=klog level=info msg="Leading the operator HA deployment" subsys=cilium-operator-generic level=info msg="Start hook executed" duration=11.785021ms function="*api.server.Start" subsys=hive # excluded following successful logs
0 -
Hi @izzl,
I'm almost 100% its a network issue
You are correct. When network related components fail to start, typically the infrastructure network is improperly configured.
Have you tried rebooting the VMs? At times the reboot helps to reset or apply necessary changes that enable traffic between components.
What type of infrastructure is hosting your cluster? What cloud or local hypervisor? What are the VM sizes (cpu, mem, disk), the guest OS, type of network interfaces (host, bridge, nat, ...)? How is the network set up? What about firewalls or proxy?
The introductory chapter presents two video guides to configure infrastructure on GCP and AWS, focusing on VPC network configuration and necessary firewalls and security groups respectively that would enable communication between all Kubernetes cluster components. Similar configuration options should be applied to other cloud providers and on-premises hypervisors.
Regards,
-Chris0 -
I'm hosting on Ubuntu22.04 LTS in GCP with e2-standard-8 which has 8 CPU, 32 GB, 100GB disk each. I've got one control plane and one worker node.
I've got them provisioned on the same vpc and got it working once by installing helm and updating cilium. What might I be doing wrong? I've also tried using provided cilium-cni.yaml, using helm template command for cilium and using latest cilum. I'm happy to share my scripts if that helps.
0 -
The only way I was able to get this to work was to install with helm and pin the service host:
helm repo add cilium https://helm.cilium.io/ helm repo update helm upgrade --install cilium cilium/cilium --version 1.14.1 \ --namespace kube-system \ --set kubeProxyReplacement=strict \ --set k8sServiceHost=$internal_ip \ --set k8sServicePort=6443
0 -
You are right, I went back to the start and ensured all steps were followed in order. I then got everything automated in bash script and it works fine now when running.
0 -
I had somewhat similar network issues (on Digital Ocean). Resolved now when I changed the OS from Ubuntu 22.04 to 20.04... thank you @chrispokorni
PS: It would probably be beneficial to add a note on the lab pdf of what you mentioned above.
0 -
The Installation and Configuration page of Ch3 in the lab guide does mention that the labs have been compiled on Ubuntu 20.04. Sometimes, not always though, deviating from an explicit version will result in discrepancies or misbehaviors. Especially in these scenarios compatibility should be carefully checked between components of the software stack to ensure the environment is properly installed and configured.
Regards,
-Chris0 -
Hi @chrispokorni,
Yes, I noticed that the OS version was specified in the guide. My only suggestion was that a mention was added explaining what you said above:
[...] However, Ubuntu 22.04 LTS is known to cause some networking issues, that is why we still recommend 20.04 LTS
0
Categories
- All Categories
- 206 LFX Mentorship
- 206 LFX Mentorship: Linux Kernel
- 733 Linux Foundation IT Professional Programs
- 339 Cloud Engineer IT Professional Program
- 165 Advanced Cloud Engineer IT Professional Program
- 66 DevOps Engineer IT Professional Program
- 132 Cloud Native Developer IT Professional Program
- 119 Express Training Courses
- 119 Express Courses - Discussion Forum
- 5.9K Training Courses
- 40 LFC110 Class Forum - Discontinued
- 66 LFC131 Class Forum
- 39 LFD102 Class Forum
- 220 LFD103 Class Forum
- 17 LFD110 Class Forum
- 32 LFD121 Class Forum
- 17 LFD133 Class Forum
- 6 LFD134 Class Forum
- 17 LFD137 Class Forum
- 70 LFD201 Class Forum
- 3 LFD210 Class Forum
- 2 LFD210-CN Class Forum
- 2 LFD213 Class Forum - Discontinued
- 128 LFD232 Class Forum - Discontinued
- 1 LFD233 Class Forum
- 3 LFD237 Class Forum
- 23 LFD254 Class Forum
- 684 LFD259 Class Forum
- 109 LFD272 Class Forum
- 3 LFD272-JP クラス フォーラム
- 10 LFD273 Class Forum
- 97 LFS101 Class Forum
- LFS111 Class Forum
- 2 LFS112 Class Forum
- 1 LFS116 Class Forum
- 3 LFS118 Class Forum
- 2 LFS142 Class Forum
- 3 LFS144 Class Forum
- 3 LFS145 Class Forum
- 1 LFS146 Class Forum
- 2 LFS147 Class Forum
- 8 LFS151 Class Forum
- 1 LFS157 Class Forum
- 10 LFS158 Class Forum
- 4 LFS162 Class Forum
- 1 LFS166 Class Forum
- 3 LFS167 Class Forum
- 1 LFS170 Class Forum
- 1 LFS171 Class Forum
- 2 LFS178 Class Forum
- 2 LFS180 Class Forum
- 1 LFS182 Class Forum
- 4 LFS183 Class Forum
- 30 LFS200 Class Forum
- 737 LFS201 Class Forum - Discontinued
- 2 LFS201-JP クラス フォーラム
- 17 LFS203 Class Forum
- 113 LFS207 Class Forum
- 1 LFS207-DE-Klassenforum
- LFS207-JP クラス フォーラム
- 301 LFS211 Class Forum
- 55 LFS216 Class Forum
- 49 LFS241 Class Forum
- 43 LFS242 Class Forum
- 37 LFS243 Class Forum
- 13 LFS244 Class Forum
- 1 LFS245 Class Forum
- 45 LFS250 Class Forum
- 1 LFS250-JP クラス フォーラム
- LFS251 Class Forum
- 143 LFS253 Class Forum
- LFS254 Class Forum
- LFS255 Class Forum
- 6 LFS256 Class Forum
- LFS257 Class Forum
- 1.2K LFS258 Class Forum
- 9 LFS258-JP クラス フォーラム
- 114 LFS260 Class Forum
- 152 LFS261 Class Forum
- 41 LFS262 Class Forum
- 82 LFS263 Class Forum - Discontinued
- 15 LFS264 Class Forum - Discontinued
- 11 LFS266 Class Forum - Discontinued
- 23 LFS267 Class Forum
- 18 LFS268 Class Forum
- 29 LFS269 Class Forum
- 199 LFS272 Class Forum
- 1 LFS272-JP クラス フォーラム
- LFS274 Class Forum
- 3 LFS281 Class Forum
- 2 LFW111 Class Forum
- 257 LFW211 Class Forum
- 176 LFW212 Class Forum
- 12 SKF100 Class Forum
- SKF200 Class Forum
- 791 Hardware
- 199 Drivers
- 68 I/O Devices
- 37 Monitors
- 98 Multimedia
- 174 Networking
- 91 Printers & Scanners
- 85 Storage
- 754 Linux Distributions
- 82 Debian
- 67 Fedora
- 16 Linux Mint
- 13 Mageia
- 23 openSUSE
- 147 Red Hat Enterprise
- 31 Slackware
- 13 SUSE Enterprise
- 351 Ubuntu
- 464 Linux System Administration
- 39 Cloud Computing
- 70 Command Line/Scripting
- Github systems admin projects
- 91 Linux Security
- 78 Network Management
- 101 System Management
- 47 Web Management
- 56 Mobile Computing
- 17 Android
- 28 Development
- 1.2K New to Linux
- 1K Getting Started with Linux
- 365 Off Topic
- 113 Introductions
- 171 Small Talk
- 20 Study Material
- 523 Programming and Development
- 292 Kernel Development
- 213 Software Development
- 1.1K Software
- 212 Applications
- 181 Command Line
- 3 Compiling/Installing
- 405 Games
- 311 Installation
- 79 All In Program
- 79 All In Forum
Upcoming Training
-
August 20, 2018
Kubernetes Administration (LFS458)
-
August 20, 2018
Linux System Administration (LFS301)
-
August 27, 2018
Open Source Virtualization (LFS462)
-
August 27, 2018
Linux Kernel Debugging and Security (LFD440)