gatekeeper CrashLoopBackOff
Hi,
I setup two Ubuntu 18.04 server, one is master, one is worker as described on tutorial.
When i run
kubectl create -f gatekeeper.yaml
pods that scheduled on master node become RUNNING state but on worker node they are not.
State: Waiting Reason: CrashLoopBackOff Last State: Terminated Reason: Error Exit Code: 2 Started: Wed, 04 Aug 2021 14:26:11 +0000 Finished: Wed, 04 Aug 2021 14:26:39 +0000 Ready: False Restart Count: 9 Limits: cpu: 1 memory: 512Mi Requests: cpu: 100m memory: 256Mi Liveness: http-get http://:9090/healthz delay=0s timeout=1s period=10s #success=1 #failure=3 Readiness: http-get http://:9090/readyz delay=0s timeout=1s period=10s #success=1 #failure=3
I tried to install nginx to see if there is a problem on worker node but nginx pods on worker node are in running mode, so it seems there is no problem about calico.
Regards
Comments
-
Hello,
We would need more information in order to troubleshoot the problem. Are all of your pods running, any errors?
Did you use the same lab setup when you were studying for the CKA? Any problems then? What is you VM provider setup, AWS, GCE, VirtualBox, VMWare? Are you sure there is no firewall between nodes?
I'd use the skills you got from your CKA to figure out the error, which would let me help find the fix.
Regards,
0 -
Thanks for fast reply.
Here is the pod statusargela@argela:~/gk$ kubectl get pods -A -o wide NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES gatekeeper-system gatekeeper-audit-54b5f86d57-nkz6q 0/1 Running 16 37m 192.168.171.72 worker <none> <none> gatekeeper-system gatekeeper-controller-manager-5b96bd668-88scv 1/1 Running 0 37m 192.168.132.69 argela <none> <none> gatekeeper-system gatekeeper-controller-manager-5b96bd668-f8z6j 0/1 CrashLoopBackOff 15 37m 192.168.171.73 worker <none> <none> gatekeeper-system gatekeeper-controller-manager-5b96bd668-sdv2n 0/1 CrashLoopBackOff 15 37m 192.168.171.74 worker <none> <none> kube-system calico-kube-controllers-5f6cfd688c-sl5d5 1/1 Running 0 9h 192.168.132.65 argela <none> <none> kube-system calico-node-hsfbl 1/1 Running 0 9h 192.168.20.225 argela <none> <none> kube-system calico-node-m6qw2 1/1 Running 0 8h 192.168.20.232 worker <none> <none> kube-system coredns-74ff55c5b-dbtct 1/1 Running 0 9h 192.168.132.66 argela <none> <none> kube-system coredns-74ff55c5b-wkgdx 1/1 Running 0 9h 192.168.132.67 argela <none> <none> kube-system etcd-argela 1/1 Running 0 9h 192.168.20.225 argela <none> <none> kube-system kube-apiserver-argela 1/1 Running 0 9h 192.168.20.225 argela <none> <none> kube-system kube-controller-manager-argela 1/1 Running 0 9h 192.168.20.225 argela <none> <none> kube-system kube-proxy-p5nlm 1/1 Running 0 9h 192.168.20.225 argela <none> <none> kube-system kube-proxy-pbp6r 1/1 Running 0 8h 192.168.20.232 worker <none> <none> kube-system kube-scheduler-argela 1/1 Running 0 9h 192.168.20.225 argela <none> <none>
It seems there is a problem with readiness and liveness url checks but there is no log for these pods.
For example gatekeeper-audit-54b5f86d57-nkz6q pod is in " 0/1 Running" status but no log for troubleshooting:argela@argela:~/gk$ kubectl -n gatekeeper-system logs gatekeeper-audit-54b5f86d57-nkz6q argela@argela:~/gk$
I've disabled firewall, to be sure I installed nginx, it installed successfully on both nodes. No ERROR line for calico.
Last State: Terminated Reason: Error Exit Code: 2
I searched for exit code 2, one of the article says it is an error from application.
If you want other details i can provide.
This is a new setup for CKS, on Vmware.Thanks
0 -
Hello,
From what I can see you have one running (gatekeeper-controller-manager-5b96bd668-88scv), and the other two in error state.
What is the size of your VM, in CPU and Memory? When you ran kubectl create -f gatekeeper.yaml did you see any errors?
Are you sure you have no firewall blocking traffic between your VMs? It seems the failed pods are all on your worker, what is the condition of that node? Does htop show it has enough resources?
Use the same troubleshooting skills from your CKA to find the error so we can work out the fix.
Regards,
0 -
Hi, ok i'll try more, thank you.
Yes, if gatekeeper pod scheduled on master node, it becomes running state with no problem, if it is scheduled on worker node CrashLoopBackOff occured.
here is the resource status:
master:
argela@argela:~/gk$ free -m total used free shared buff/cache available Mem: 7976 1285 1679 2 5011 6742 Swap: 0 0 0 argela@argela:~/gk$ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 3 On-line CPU(s) list: 0-2 Thread(s) per core: 1 Core(s) per socket: 1 Socket(s): 3 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 85 Model name: Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz Stepping: 4 CPU MHz: 2294.609 BogoMIPS: 4589.21 Hypervisor vendor: VMware Virtualization type: full L1d cache: 32K L1i cache: 32K L2 cache: 1024K L3 cache: 25344K NUMA node0 CPU(s): 0-2 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts nopl xtopology tsc_reliable nonstop_tsc cpuid pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single pti ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 avx2 smep bmi2 invpcid avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xsaves arat pku ospke md_clear flush_l1d arch_capabilities argela@argela:~/gk$
1 [##** 2.7%] Tasks: 92, 401 thr; 2 running 2 [###** 4.6%] Load average: 1.04 0.81 0.48 3 [###** 5.3%] Uptime: 2 days, 07:48:25 Mem[||||||||||||||####************************************************** 1.26G/7.79G] Swp[ 0K/0K]
worker:
1 [## 2.0%] Tasks: 60, 196 thr; 1 running 2 [##** 2.7%] Load average: 0.03 0.06 0.07 3 [##* 2.0%] Uptime: 09:33:19 Mem[|||||##*************** 446M/7.79G] Swp[ 0K/0K]
argela@worker:~$ free -m total used free shared buff/cache available Mem: 7976 468 5979 1 1528 7281 Swap: 0 0 0 argela@worker:~$ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 3 On-line CPU(s) list: 0-2 Thread(s) per core: 1 Core(s) per socket: 1 Socket(s): 3 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 85 Model name: Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz Stepping: 4 CPU MHz: 2294.609 BogoMIPS: 4589.21 Hypervisor vendor: VMware Virtualization type: full L1d cache: 32K L1i cache: 32K L2 cache: 1024K L3 cache: 25344K NUMA node0 CPU(s): 0-2 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts nopl xtopology tsc_reliable nonstop_tsc cpuid pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single pti ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 avx2 smep bmi2 invpcid avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xsaves arat pku ospke md_clear flush_l1d arch_capabilities argela@worker:~$
0 -
Hello,
Check your nodes, see if there is some sort of taint causing a failure on the worker. I'd also double-check you don't have a firewall in the way between nodes, or running on one of your nodes. When you create a deployment running nginx and scale it up, do the pods run when on the worker?
What other logs, events, and describe output do you find for pods running on your worker?
Regards,
0 -
Hi again, there is a problem with Liveness and Readiness control but couldn't find any cause as there is no application log.
Here is the information from system:argela@argela:~/gk$ kubectl logs -n gatekeeper-system gatekeeper-audit-54b5f86d57-c4fpm argela@argela:~/gk$
Events:
gatekeeper-system 0s Normal Pulled pod/gatekeeper-audit-54b5f86d57-c4fpm Successfully pulled image "openpolicyagent/gatekeeper:v3.3.0" in 1.680295184s gatekeeper-system 0s Normal Created pod/gatekeeper-audit-54b5f86d57-c4fpm Created container manager gatekeeper-system 0s Normal Started pod/gatekeeper-audit-54b5f86d57-c4fpm Started container manager gatekeeper-system 0s Warning Unhealthy pod/gatekeeper-controller-manager-5b96bd668-c8vfp Liveness probe failed: Get "http://192.168.171.83:9090/healthz": dial tcp 192.168.171.83:9090: connect: connection refused gatekeeper-system 0s Warning Unhealthy pod/gatekeeper-controller-manager-5b96bd668-c8vfp Readiness probe failed: Get "http://192.168.171.83:9090/readyz": dial tcp 192.168.171.83:9090: connect: connection refused gatekeeper-system 0s Warning Unhealthy pod/gatekeeper-audit-54b5f86d57-c4fpm Liveness probe failed: Get "http://192.168.171.84:9090/healthz": dial tcp 192.168.171.84:9090: connect: connection refused gatekeeper-system 0s Warning Unhealthy pod/gatekeeper-controller-manager-5b96bd668-5sd92 Liveness probe failed: Get "http://192.168.171.82:9090/healthz": dial tcp 192.168.171.82:9090: connect: connection refused gatekeeper-system 0s Warning Unhealthy pod/gatekeeper-controller-manager-5b96bd668-5sd92 Readiness probe failed: Get "http://192.168.171.82:9090/readyz": dial tcp 192.168.171.82:9090: connect: connection refused gatekeeper-system 0s Warning Unhealthy pod/gatekeeper-audit-54b5f86d57-c4fpm Readiness probe failed: Get "http://192.168.171.84:9090/readyz": dial tcp 192.168.171.84:9090: connect: connection refused gatekeeper-system 0s Warning Unhealthy pod/gatekeeper-controller-manager-5b96bd668-c8vfp Liveness probe failed: Get "http://192.168.171.83:9090/healthz": dial tcp 192.168.171.83:9090: connect: connection refused gatekeeper-system 0s Warning Unhealthy pod/gatekeeper-controller-manager-5b96bd668-c8vfp Readiness probe failed: Get "http://192.168.171.83:9090/readyz": dial tcp 192.168.171.83:9090: connect: connection refused
Other information
argela@argela:~$ kubectl describe nodes | grep -i Taint
Taints:
Taints:master:
argela@argela:~$ systemctl status ufw
ufw.service - Uncomplicated firewall
Loaded: loaded (/lib/systemd/system/ufw.service; disabled; vendor preset: enabled)
Active: inactive (dead)
Docs: man:ufw(8)
argela@argela:~$ systemctl status apparmor.service
â— apparmor.service - AppArmor initialization
Loaded: loaded (/lib/systemd/system/apparmor.service; disabled; vendor preset: enabled)
Active: inactive (dead)
Docs: man:apparmor(7)
http://wiki.apparmor.net/
argela@argela:~$worker:
argela@worker:~$ systemctl status ufw
â— ufw.service - Uncomplicated firewall
Loaded: loaded (/lib/systemd/system/ufw.service; disabled; vendor preset: enabled)
Active: inactive (dead)
Docs: man:ufw(8)
argela@worker:~$ systemctl status apparmor.service
â— apparmor.service - AppArmor initialization
Loaded: loaded (/lib/systemd/system/apparmor.service; disabled; vendor preset: enabled)
Active: inactive (dead)
Docs: man:apparmor(7)
http://wiki.apparmor.net/ngingx deployment: (scheduled on both nodes)
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE EADINESS GATES
default nginx-deployment-66b6c48dd5-266gb 1/1 Running 0 5s 192.168.132.75 argela
default nginx-deployment-66b6c48dd5-724gh 1/1 Running 0 25s 192.168.171.76 worker
default nginx-deployment-66b6c48dd5-8fp49 1/1 Running 0 25s 192.168.132.73 argela
default nginx-deployment-66b6c48dd5-8hfwn 1/1 Running 0 35s 192.168.171.73 worker
default nginx-deployment-66b6c48dd5-cdgtf 1/1 Running 0 104s 192.168.171.69 worker
default nginx-deployment-66b6c48dd5-csfcw 1/1 Running 0 35s 192.168.171.74 worker
default nginx-deployment-66b6c48dd5-d6rml 1/1 Running 0 5s 192.168.132.74 argela
default nginx-deployment-66b6c48dd5-f8gvk 1/1 Running 0 55s 192.168.171.71 worker
default nginx-deployment-66b6c48dd5-lbjqc 1/1 Running 0 55s 192.168.171.72 worker
default nginx-deployment-66b6c48dd5-mlgj4 1/1 Running 0 104s 192.168.171.70 worker
default nginx-deployment-66b6c48dd5-pw87h 1/1 Running 0 5s 192.168.171.79 worker
default nginx-deployment-66b6c48dd5-t2hw9 1/1 Running 0 25s 192.168.171.77 worker
default nginx-deployment-66b6c48dd5-tz95g 1/1 Running 0 104s 192.168.171.68 worker
default nginx-deployment-66b6c48dd5-xw6rd 1/1 Running 0 25s 192.168.171.75 worker
default nginx-deployment-66b6c48dd5-zxmkm 1/1 Running 0 25s 192.168.171.78 workerThansk for your help
Regards,
Yavuz0 -
Hello,
I saw that you checked the firewall on the VMs themselves, is there one on the host that is limiting traffic between the VMs?
The gatekeep-controller-manager running on the control plane is working, but the two on the worker node are not. This leads me to think the traffic they are using to talk to each other is being blocked in some way. Have you worked with networkpolicies yet on this cluster? Those don't show when you look at the OS firewall. It really seems like something is between the two nodes causing issues.
Does wireshark show gatekeeper traffic leaving the worker and also being accepted on the control plane? Is there any chance there is overlap between the 192.168 range and the VMs?
Regards,
0 -
Hi, i've changed "--pod-network-cidr" to 10.0.0.33/16 and it is ok now.
I'm not sure, is it a problem if ip addresses of nodes are in calico's network cidr range.
I mean my ip addresses are 192.168.20.225 and 232, they are in 192.168.0.0/16 range.
Anyway it seems problems is resoved.
Thanks for your guidence.Every 2.0s: kubectl get pods -A -o wide argela: Sat Aug 7 10:31:41 2021 NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES gatekeeper-system gatekeeper-audit-54b5f86d57-pjcfd 1/1 Running 0 87s 10.0.171.65 worker <none> <none> gatekeeper-system gatekeeper-controller-manager-5b96bd668-dd2cq 1/1 Running 0 87s 10.0.132.68 argela <none> <none> gatekeeper-system gatekeeper-controller-manager-5b96bd668-pftv8 1/1 Running 0 87s 10.0.171.67 worker <none> <none> gatekeeper-system gatekeeper-controller-manager-5b96bd668-vnrj4 1/1 Running 0 87s 10.0.171.66 worker <none> <none> kube-system calico-kube-controllers-5f6cfd688c-6r48g 1/1 Running 0 5m7s 10.0.132.66 argela <none> <none> kube-system calico-node-4hr2r 1/1 Running 0 3m8s 192.168.20.232 worker <none> <none> kube-system calico-node-lmkq6 1/1 Running 0 5m7s 192.168.20.225 argela <none> <none> kube-system coredns-74ff55c5b-f6nl5 1/1 Running 0 5m7s 10.0.132.67 argela <none> <none> kube-system coredns-74ff55c5b-hzc8f 1/1 Running 0 5m7s 10.0.132.65 argela <none> <none> kube-system etcd-argela 1/1 Running 0 5m15s 192.168.20.225 argela <none> <none> kube-system kube-apiserver-argela 1/1 Running 0 5m15s 192.168.20.225 argela <none> <none> kube-system kube-controller-manager-argela 1/1 Running 0 5m15s 192.168.20.225 argela <none> <none> kube-system kube-proxy-h6wt2 1/1 Running 0 3m8s 192.168.20.232 worker <none> <none> kube-system kube-proxy-z8nk9 1/1 Running 0 5m7s 192.168.20.225 argela <none> <none> kube-system kube-scheduler-argela 1/1 Running 0 5m15s 192.168.20.225 argela <none> <none>
regards,
0 -
Hello,
Yes, your routing table was sending the egress worker traffic to the host not the control plane.
Regards,
0
Categories
- All Categories
- 207 LFX Mentorship
- 207 LFX Mentorship: Linux Kernel
- 735 Linux Foundation IT Professional Programs
- 339 Cloud Engineer IT Professional Program
- 167 Advanced Cloud Engineer IT Professional Program
- 66 DevOps Engineer IT Professional Program
- 132 Cloud Native Developer IT Professional Program
- 122 Express Training Courses
- 122 Express Courses - Discussion Forum
- 5.9K Training Courses
- 40 LFC110 Class Forum - Discontinued
- 66 LFC131 Class Forum
- 39 LFD102 Class Forum
- 221 LFD103 Class Forum
- 17 LFD110 Class Forum
- 33 LFD121 Class Forum
- 17 LFD133 Class Forum
- 6 LFD134 Class Forum
- 17 LFD137 Class Forum
- 70 LFD201 Class Forum
- 3 LFD210 Class Forum
- 2 LFD210-CN Class Forum
- 2 LFD213 Class Forum - Discontinued
- 128 LFD232 Class Forum - Discontinued
- 1 LFD233 Class Forum
- 3 LFD237 Class Forum
- 23 LFD254 Class Forum
- 689 LFD259 Class Forum
- 109 LFD272 Class Forum
- 3 LFD272-JP クラス フォーラム
- 10 LFD273 Class Forum
- 109 LFS101 Class Forum
- LFS111 Class Forum
- 2 LFS112 Class Forum
- 1 LFS116 Class Forum
- 3 LFS118 Class Forum
- 3 LFS142 Class Forum
- 3 LFS144 Class Forum
- 3 LFS145 Class Forum
- 1 LFS146 Class Forum
- 2 LFS147 Class Forum
- 8 LFS151 Class Forum
- 1 LFS157 Class Forum
- 14 LFS158 Class Forum
- 5 LFS162 Class Forum
- 1 LFS166 Class Forum
- 3 LFS167 Class Forum
- 1 LFS170 Class Forum
- 1 LFS171 Class Forum
- 2 LFS178 Class Forum
- 2 LFS180 Class Forum
- 1 LFS182 Class Forum
- 4 LFS183 Class Forum
- 30 LFS200 Class Forum
- 737 LFS201 Class Forum - Discontinued
- 2 LFS201-JP クラス フォーラム
- 17 LFS203 Class Forum
- 117 LFS207 Class Forum
- 1 LFS207-DE-Klassenforum
- LFS207-JP クラス フォーラム
- 301 LFS211 Class Forum
- 55 LFS216 Class Forum
- 50 LFS241 Class Forum
- 43 LFS242 Class Forum
- 37 LFS243 Class Forum
- 13 LFS244 Class Forum
- 1 LFS245 Class Forum
- 45 LFS250 Class Forum
- 1 LFS250-JP クラス フォーラム
- LFS251 Class Forum
- 145 LFS253 Class Forum
- LFS254 Class Forum
- LFS255 Class Forum
- 6 LFS256 Class Forum
- LFS257 Class Forum
- 1.2K LFS258 Class Forum
- 9 LFS258-JP クラス フォーラム
- 116 LFS260 Class Forum
- 155 LFS261 Class Forum
- 41 LFS262 Class Forum
- 82 LFS263 Class Forum - Discontinued
- 15 LFS264 Class Forum - Discontinued
- 11 LFS266 Class Forum - Discontinued
- 23 LFS267 Class Forum
- 18 LFS268 Class Forum
- 29 LFS269 Class Forum
- 200 LFS272 Class Forum
- 1 LFS272-JP クラス フォーラム
- LFS274 Class Forum
- 3 LFS281 Class Forum
- 7 LFW111 Class Forum
- 257 LFW211 Class Forum
- 178 LFW212 Class Forum
- 12 SKF100 Class Forum
- SKF200 Class Forum
- 791 Hardware
- 199 Drivers
- 68 I/O Devices
- 37 Monitors
- 98 Multimedia
- 174 Networking
- 91 Printers & Scanners
- 85 Storage
- 754 Linux Distributions
- 82 Debian
- 67 Fedora
- 16 Linux Mint
- 13 Mageia
- 23 openSUSE
- 147 Red Hat Enterprise
- 31 Slackware
- 13 SUSE Enterprise
- 351 Ubuntu
- 465 Linux System Administration
- 39 Cloud Computing
- 71 Command Line/Scripting
- Github systems admin projects
- 91 Linux Security
- 78 Network Management
- 101 System Management
- 47 Web Management
- 56 Mobile Computing
- 17 Android
- 28 Development
- 1.2K New to Linux
- 1K Getting Started with Linux
- 366 Off Topic
- 114 Introductions
- 171 Small Talk
- 20 Study Material
- 534 Programming and Development
- 293 Kernel Development
- 223 Software Development
- 1.1K Software
- 212 Applications
- 182 Command Line
- 3 Compiling/Installing
- 405 Games
- 311 Installation
- 79 All In Program
- 79 All In Forum
Upcoming Training
-
August 20, 2018
Kubernetes Administration (LFS458)
-
August 20, 2018
Linux System Administration (LFS301)
-
August 27, 2018
Open Source Virtualization (LFS462)
-
August 27, 2018
Linux Kernel Debugging and Security (LFD440)