gatekeeper CrashLoopBackOff

Hi,
I setup two Ubuntu 18.04 server, one is master, one is worker as described on tutorial.
When i run
kubectl create -f gatekeeper.yaml
pods that scheduled on master node become RUNNING state but on worker node they are not.
State: Waiting Reason: CrashLoopBackOff Last State: Terminated Reason: Error Exit Code: 2 Started: Wed, 04 Aug 2021 14:26:11 +0000 Finished: Wed, 04 Aug 2021 14:26:39 +0000 Ready: False Restart Count: 9 Limits: cpu: 1 memory: 512Mi Requests: cpu: 100m memory: 256Mi Liveness: http-get http://:9090/healthz delay=0s timeout=1s period=10s #success=1 #failure=3 Readiness: http-get http://:9090/readyz delay=0s timeout=1s period=10s #success=1 #failure=3
I tried to install nginx to see if there is a problem on worker node but nginx pods on worker node are in running mode, so it seems there is no problem about calico.
Regards
Comments
-
Hello,
We would need more information in order to troubleshoot the problem. Are all of your pods running, any errors?
Did you use the same lab setup when you were studying for the CKA? Any problems then? What is you VM provider setup, AWS, GCE, VirtualBox, VMWare? Are you sure there is no firewall between nodes?
I'd use the skills you got from your CKA to figure out the error, which would let me help find the fix.
Regards,
0 -
Thanks for fast reply.
Here is the pod statusargela@argela:~/gk$ kubectl get pods -A -o wide NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES gatekeeper-system gatekeeper-audit-54b5f86d57-nkz6q 0/1 Running 16 37m 192.168.171.72 worker <none> <none> gatekeeper-system gatekeeper-controller-manager-5b96bd668-88scv 1/1 Running 0 37m 192.168.132.69 argela <none> <none> gatekeeper-system gatekeeper-controller-manager-5b96bd668-f8z6j 0/1 CrashLoopBackOff 15 37m 192.168.171.73 worker <none> <none> gatekeeper-system gatekeeper-controller-manager-5b96bd668-sdv2n 0/1 CrashLoopBackOff 15 37m 192.168.171.74 worker <none> <none> kube-system calico-kube-controllers-5f6cfd688c-sl5d5 1/1 Running 0 9h 192.168.132.65 argela <none> <none> kube-system calico-node-hsfbl 1/1 Running 0 9h 192.168.20.225 argela <none> <none> kube-system calico-node-m6qw2 1/1 Running 0 8h 192.168.20.232 worker <none> <none> kube-system coredns-74ff55c5b-dbtct 1/1 Running 0 9h 192.168.132.66 argela <none> <none> kube-system coredns-74ff55c5b-wkgdx 1/1 Running 0 9h 192.168.132.67 argela <none> <none> kube-system etcd-argela 1/1 Running 0 9h 192.168.20.225 argela <none> <none> kube-system kube-apiserver-argela 1/1 Running 0 9h 192.168.20.225 argela <none> <none> kube-system kube-controller-manager-argela 1/1 Running 0 9h 192.168.20.225 argela <none> <none> kube-system kube-proxy-p5nlm 1/1 Running 0 9h 192.168.20.225 argela <none> <none> kube-system kube-proxy-pbp6r 1/1 Running 0 8h 192.168.20.232 worker <none> <none> kube-system kube-scheduler-argela 1/1 Running 0 9h 192.168.20.225 argela <none> <none>
It seems there is a problem with readiness and liveness url checks but there is no log for these pods.
For example gatekeeper-audit-54b5f86d57-nkz6q pod is in " 0/1 Running" status but no log for troubleshooting:argela@argela:~/gk$ kubectl -n gatekeeper-system logs gatekeeper-audit-54b5f86d57-nkz6q argela@argela:~/gk$
I've disabled firewall, to be sure I installed nginx, it installed successfully on both nodes. No ERROR line for calico.
Last State: Terminated Reason: Error Exit Code: 2
I searched for exit code 2, one of the article says it is an error from application.
If you want other details i can provide.
This is a new setup for CKS, on Vmware.Thanks
0 -
Hello,
From what I can see you have one running (gatekeeper-controller-manager-5b96bd668-88scv), and the other two in error state.
What is the size of your VM, in CPU and Memory? When you ran kubectl create -f gatekeeper.yaml did you see any errors?
Are you sure you have no firewall blocking traffic between your VMs? It seems the failed pods are all on your worker, what is the condition of that node? Does htop show it has enough resources?
Use the same troubleshooting skills from your CKA to find the error so we can work out the fix.
Regards,
0 -
Hi, ok i'll try more, thank you.
Yes, if gatekeeper pod scheduled on master node, it becomes running state with no problem, if it is scheduled on worker node CrashLoopBackOff occured.
here is the resource status:
master:
argela@argela:~/gk$ free -m total used free shared buff/cache available Mem: 7976 1285 1679 2 5011 6742 Swap: 0 0 0 argela@argela:~/gk$ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 3 On-line CPU(s) list: 0-2 Thread(s) per core: 1 Core(s) per socket: 1 Socket(s): 3 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 85 Model name: Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz Stepping: 4 CPU MHz: 2294.609 BogoMIPS: 4589.21 Hypervisor vendor: VMware Virtualization type: full L1d cache: 32K L1i cache: 32K L2 cache: 1024K L3 cache: 25344K NUMA node0 CPU(s): 0-2 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts nopl xtopology tsc_reliable nonstop_tsc cpuid pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single pti ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 avx2 smep bmi2 invpcid avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xsaves arat pku ospke md_clear flush_l1d arch_capabilities argela@argela:~/gk$
1 [##** 2.7%] Tasks: 92, 401 thr; 2 running 2 [###** 4.6%] Load average: 1.04 0.81 0.48 3 [###** 5.3%] Uptime: 2 days, 07:48:25 Mem[||||||||||||||####************************************************** 1.26G/7.79G] Swp[ 0K/0K]
worker:
1 [## 2.0%] Tasks: 60, 196 thr; 1 running 2 [##** 2.7%] Load average: 0.03 0.06 0.07 3 [##* 2.0%] Uptime: 09:33:19 Mem[|||||##*************** 446M/7.79G] Swp[ 0K/0K]
argela@worker:~$ free -m total used free shared buff/cache available Mem: 7976 468 5979 1 1528 7281 Swap: 0 0 0 argela@worker:~$ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 3 On-line CPU(s) list: 0-2 Thread(s) per core: 1 Core(s) per socket: 1 Socket(s): 3 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 85 Model name: Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz Stepping: 4 CPU MHz: 2294.609 BogoMIPS: 4589.21 Hypervisor vendor: VMware Virtualization type: full L1d cache: 32K L1i cache: 32K L2 cache: 1024K L3 cache: 25344K NUMA node0 CPU(s): 0-2 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts nopl xtopology tsc_reliable nonstop_tsc cpuid pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single pti ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 avx2 smep bmi2 invpcid avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xsaves arat pku ospke md_clear flush_l1d arch_capabilities argela@worker:~$
0 -
Hello,
Check your nodes, see if there is some sort of taint causing a failure on the worker. I'd also double-check you don't have a firewall in the way between nodes, or running on one of your nodes. When you create a deployment running nginx and scale it up, do the pods run when on the worker?
What other logs, events, and describe output do you find for pods running on your worker?
Regards,
0 -
Hi again, there is a problem with Liveness and Readiness control but couldn't find any cause as there is no application log.
Here is the information from system:argela@argela:~/gk$ kubectl logs -n gatekeeper-system gatekeeper-audit-54b5f86d57-c4fpm argela@argela:~/gk$
Events:
gatekeeper-system 0s Normal Pulled pod/gatekeeper-audit-54b5f86d57-c4fpm Successfully pulled image "openpolicyagent/gatekeeper:v3.3.0" in 1.680295184s gatekeeper-system 0s Normal Created pod/gatekeeper-audit-54b5f86d57-c4fpm Created container manager gatekeeper-system 0s Normal Started pod/gatekeeper-audit-54b5f86d57-c4fpm Started container manager gatekeeper-system 0s Warning Unhealthy pod/gatekeeper-controller-manager-5b96bd668-c8vfp Liveness probe failed: Get "http://192.168.171.83:9090/healthz": dial tcp 192.168.171.83:9090: connect: connection refused gatekeeper-system 0s Warning Unhealthy pod/gatekeeper-controller-manager-5b96bd668-c8vfp Readiness probe failed: Get "http://192.168.171.83:9090/readyz": dial tcp 192.168.171.83:9090: connect: connection refused gatekeeper-system 0s Warning Unhealthy pod/gatekeeper-audit-54b5f86d57-c4fpm Liveness probe failed: Get "http://192.168.171.84:9090/healthz": dial tcp 192.168.171.84:9090: connect: connection refused gatekeeper-system 0s Warning Unhealthy pod/gatekeeper-controller-manager-5b96bd668-5sd92 Liveness probe failed: Get "http://192.168.171.82:9090/healthz": dial tcp 192.168.171.82:9090: connect: connection refused gatekeeper-system 0s Warning Unhealthy pod/gatekeeper-controller-manager-5b96bd668-5sd92 Readiness probe failed: Get "http://192.168.171.82:9090/readyz": dial tcp 192.168.171.82:9090: connect: connection refused gatekeeper-system 0s Warning Unhealthy pod/gatekeeper-audit-54b5f86d57-c4fpm Readiness probe failed: Get "http://192.168.171.84:9090/readyz": dial tcp 192.168.171.84:9090: connect: connection refused gatekeeper-system 0s Warning Unhealthy pod/gatekeeper-controller-manager-5b96bd668-c8vfp Liveness probe failed: Get "http://192.168.171.83:9090/healthz": dial tcp 192.168.171.83:9090: connect: connection refused gatekeeper-system 0s Warning Unhealthy pod/gatekeeper-controller-manager-5b96bd668-c8vfp Readiness probe failed: Get "http://192.168.171.83:9090/readyz": dial tcp 192.168.171.83:9090: connect: connection refused
Other information
argela@argela:~$ kubectl describe nodes | grep -i Taint
Taints:
Taints:master:
argela@argela:~$ systemctl status ufw
ufw.service - Uncomplicated firewall
Loaded: loaded (/lib/systemd/system/ufw.service; disabled; vendor preset: enabled)
Active: inactive (dead)
Docs: man:ufw(8)
argela@argela:~$ systemctl status apparmor.service
â— apparmor.service - AppArmor initialization
Loaded: loaded (/lib/systemd/system/apparmor.service; disabled; vendor preset: enabled)
Active: inactive (dead)
Docs: man:apparmor(7)
http://wiki.apparmor.net/
argela@argela:~$worker:
argela@worker:~$ systemctl status ufw
â— ufw.service - Uncomplicated firewall
Loaded: loaded (/lib/systemd/system/ufw.service; disabled; vendor preset: enabled)
Active: inactive (dead)
Docs: man:ufw(8)
argela@worker:~$ systemctl status apparmor.service
â— apparmor.service - AppArmor initialization
Loaded: loaded (/lib/systemd/system/apparmor.service; disabled; vendor preset: enabled)
Active: inactive (dead)
Docs: man:apparmor(7)
http://wiki.apparmor.net/ngingx deployment: (scheduled on both nodes)
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE EADINESS GATES
default nginx-deployment-66b6c48dd5-266gb 1/1 Running 0 5s 192.168.132.75 argela
default nginx-deployment-66b6c48dd5-724gh 1/1 Running 0 25s 192.168.171.76 worker
default nginx-deployment-66b6c48dd5-8fp49 1/1 Running 0 25s 192.168.132.73 argela
default nginx-deployment-66b6c48dd5-8hfwn 1/1 Running 0 35s 192.168.171.73 worker
default nginx-deployment-66b6c48dd5-cdgtf 1/1 Running 0 104s 192.168.171.69 worker
default nginx-deployment-66b6c48dd5-csfcw 1/1 Running 0 35s 192.168.171.74 worker
default nginx-deployment-66b6c48dd5-d6rml 1/1 Running 0 5s 192.168.132.74 argela
default nginx-deployment-66b6c48dd5-f8gvk 1/1 Running 0 55s 192.168.171.71 worker
default nginx-deployment-66b6c48dd5-lbjqc 1/1 Running 0 55s 192.168.171.72 worker
default nginx-deployment-66b6c48dd5-mlgj4 1/1 Running 0 104s 192.168.171.70 worker
default nginx-deployment-66b6c48dd5-pw87h 1/1 Running 0 5s 192.168.171.79 worker
default nginx-deployment-66b6c48dd5-t2hw9 1/1 Running 0 25s 192.168.171.77 worker
default nginx-deployment-66b6c48dd5-tz95g 1/1 Running 0 104s 192.168.171.68 worker
default nginx-deployment-66b6c48dd5-xw6rd 1/1 Running 0 25s 192.168.171.75 worker
default nginx-deployment-66b6c48dd5-zxmkm 1/1 Running 0 25s 192.168.171.78 workerThansk for your help
Regards,
Yavuz0 -
Hello,
I saw that you checked the firewall on the VMs themselves, is there one on the host that is limiting traffic between the VMs?
The gatekeep-controller-manager running on the control plane is working, but the two on the worker node are not. This leads me to think the traffic they are using to talk to each other is being blocked in some way. Have you worked with networkpolicies yet on this cluster? Those don't show when you look at the OS firewall. It really seems like something is between the two nodes causing issues.
Does wireshark show gatekeeper traffic leaving the worker and also being accepted on the control plane? Is there any chance there is overlap between the 192.168 range and the VMs?
Regards,
0 -
Hi, i've changed "--pod-network-cidr" to 10.0.0.33/16 and it is ok now.
I'm not sure, is it a problem if ip addresses of nodes are in calico's network cidr range.
I mean my ip addresses are 192.168.20.225 and 232, they are in 192.168.0.0/16 range.
Anyway it seems problems is resoved.
Thanks for your guidence.Every 2.0s: kubectl get pods -A -o wide argela: Sat Aug 7 10:31:41 2021 NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES gatekeeper-system gatekeeper-audit-54b5f86d57-pjcfd 1/1 Running 0 87s 10.0.171.65 worker <none> <none> gatekeeper-system gatekeeper-controller-manager-5b96bd668-dd2cq 1/1 Running 0 87s 10.0.132.68 argela <none> <none> gatekeeper-system gatekeeper-controller-manager-5b96bd668-pftv8 1/1 Running 0 87s 10.0.171.67 worker <none> <none> gatekeeper-system gatekeeper-controller-manager-5b96bd668-vnrj4 1/1 Running 0 87s 10.0.171.66 worker <none> <none> kube-system calico-kube-controllers-5f6cfd688c-6r48g 1/1 Running 0 5m7s 10.0.132.66 argela <none> <none> kube-system calico-node-4hr2r 1/1 Running 0 3m8s 192.168.20.232 worker <none> <none> kube-system calico-node-lmkq6 1/1 Running 0 5m7s 192.168.20.225 argela <none> <none> kube-system coredns-74ff55c5b-f6nl5 1/1 Running 0 5m7s 10.0.132.67 argela <none> <none> kube-system coredns-74ff55c5b-hzc8f 1/1 Running 0 5m7s 10.0.132.65 argela <none> <none> kube-system etcd-argela 1/1 Running 0 5m15s 192.168.20.225 argela <none> <none> kube-system kube-apiserver-argela 1/1 Running 0 5m15s 192.168.20.225 argela <none> <none> kube-system kube-controller-manager-argela 1/1 Running 0 5m15s 192.168.20.225 argela <none> <none> kube-system kube-proxy-h6wt2 1/1 Running 0 3m8s 192.168.20.232 worker <none> <none> kube-system kube-proxy-z8nk9 1/1 Running 0 5m7s 192.168.20.225 argela <none> <none> kube-system kube-scheduler-argela 1/1 Running 0 5m15s 192.168.20.225 argela <none> <none>
regards,
0 -
Hello,
Yes, your routing table was sending the egress worker traffic to the host not the control plane.
Regards,
0
Categories
- All Categories
- 49 LFX Mentorship
- 102 LFX Mentorship: Linux Kernel
- 550 Linux Foundation Boot Camps
- 294 Cloud Engineer Boot Camp
- 118 Advanced Cloud Engineer Boot Camp
- 52 DevOps Engineer Boot Camp
- 53 Cloud Native Developer Boot Camp
- 4 Express Training Courses
- 4 Express Courses - Discussion Forum
- 1.9K Training Courses
- 18 LFC110 Class Forum
- 6 LFC131 Class Forum
- 25 LFD102 Class Forum
- 150 LFD103 Class Forum
- 17 LFD121 Class Forum
- 61 LFD201 Class Forum
- LFD210 Class Forum
- LFD210-CN Class Forum
- 1 LFD213 Class Forum - Discontinued
- 128 LFD232 Class Forum
- LFD237 Class Forum
- 23 LFD254 Class Forum
- 596 LFD259 Class Forum
- 102 LFD272 Class Forum
- 1 LFD272-JP クラス フォーラム
- LFD273 Class Forum
- 2 LFS145 Class Forum
- 24 LFS200 Class Forum
- 739 LFS201 Class Forum
- 1 LFS201-JP クラス フォーラム
- 3 LFS203 Class Forum
- 69 LFS207 Class Forum
- 300 LFS211 Class Forum
- 54 LFS216 Class Forum
- 47 LFS241 Class Forum
- 41 LFS242 Class Forum
- 37 LFS243 Class Forum
- 11 LFS244 Class Forum
- 33 LFS250 Class Forum
- 1 LFS250-JP クラス フォーラム
- LFS251 Class Forum
- 139 LFS253 Class Forum
- 1K LFS258 Class Forum
- 10 LFS258-JP クラス フォーラム
- 92 LFS260 Class Forum
- 129 LFS261 Class Forum
- 32 LFS262 Class Forum
- 79 LFS263 Class Forum
- 15 LFS264 Class Forum
- 11 LFS266 Class Forum
- 17 LFS267 Class Forum
- 17 LFS268 Class Forum
- 23 LFS269 Class Forum
- 203 LFS272 Class Forum
- 1 LFS272-JP クラス フォーラム
- LFS281 Class Forum
- 220 LFW211 Class Forum
- 166 LFW212 Class Forum
- SKF100 Class Forum
- 901 Hardware
- 219 Drivers
- 74 I/O Devices
- 44 Monitors
- 115 Multimedia
- 208 Networking
- 101 Printers & Scanners
- 85 Storage
- 761 Linux Distributions
- 88 Debian
- 66 Fedora
- 15 Linux Mint
- 13 Mageia
- 24 openSUSE
- 141 Red Hat Enterprise
- 33 Slackware
- 13 SUSE Enterprise
- 356 Ubuntu
- 476 Linux System Administration
- 41 Cloud Computing
- 69 Command Line/Scripting
- Github systems admin projects
- 94 Linux Security
- 77 Network Management
- 108 System Management
- 49 Web Management
- 66 Mobile Computing
- 23 Android
- 29 Development
- 1.2K New to Linux
- 1.1K Getting Started with Linux
- 536 Off Topic
- 131 Introductions
- 216 Small Talk
- 21 Study Material
- 816 Programming and Development
- 275 Kernel Development
- 507 Software Development
- 927 Software
- 260 Applications
- 183 Command Line
- 3 Compiling/Installing
- 76 Games
- 316 Installation
- 59 All In Program
- 59 All In Forum
Upcoming Training
-
August 20, 2018
Kubernetes Administration (LFS458)
-
August 20, 2018
Linux System Administration (LFS301)
-
August 27, 2018
Open Source Virtualization (LFS462)
-
August 27, 2018
Linux Kernel Debugging and Security (LFD440)