gatekeeper CrashLoopBackOff

yavuz · August 2021

Hi,
I setup two Ubuntu 18.04 server, one is master, one is worker as described on tutorial.
When i run
kubectl create -f gatekeeper.yaml
pods that scheduled on master node become RUNNING state but on worker node they are not.

State:          Waiting
  Reason:       CrashLoopBackOff
Last State:     Terminated
  Reason:       Error
  Exit Code:    2
  Started:      Wed, 04 Aug 2021 14:26:11 +0000
  Finished:     Wed, 04 Aug 2021 14:26:39 +0000
Ready:          False
Restart Count:  9
Limits:
  cpu:     1
  memory:  512Mi
Requests:
  cpu:      100m
  memory:   256Mi
Liveness:   http-get http://:9090/healthz delay=0s timeout=1s period=10s #success=1 #failure=3
Readiness:  http-get http://:9090/readyz delay=0s timeout=1s period=10s #success=1 #failure=3

I tried to install nginx to see if there is a problem on worker node but nginx pods on worker node are in running mode, so it seems there is no problem about calico.

Regards

serewicz · August 2021

Hello,

We would need more information in order to troubleshoot the problem. Are all of your pods running, any errors?

Did you use the same lab setup when you were studying for the CKA? Any problems then? What is you VM provider setup, AWS, GCE, VirtualBox, VMWare? Are you sure there is no firewall between nodes?

I'd use the skills you got from your CKA to figure out the error, which would let me help find the fix.

Regards,

yavuz · August 2021

Thanks for fast reply.
Here is the pod status

argela@argela:~/gk$ kubectl get pods -A -o wide
NAMESPACE           NAME                                            READY   STATUS             RESTARTS   AGE   IP               NODE     NOMINATED NODE   READINESS GATES
gatekeeper-system   gatekeeper-audit-54b5f86d57-nkz6q               0/1     Running            16         37m   192.168.171.72   worker   <none>           <none>
gatekeeper-system   gatekeeper-controller-manager-5b96bd668-88scv   1/1     Running            0          37m   192.168.132.69   argela   <none>           <none>
gatekeeper-system   gatekeeper-controller-manager-5b96bd668-f8z6j   0/1     CrashLoopBackOff   15         37m   192.168.171.73   worker   <none>           <none>
gatekeeper-system   gatekeeper-controller-manager-5b96bd668-sdv2n   0/1     CrashLoopBackOff   15         37m   192.168.171.74   worker   <none>           <none>
kube-system         calico-kube-controllers-5f6cfd688c-sl5d5        1/1     Running            0          9h    192.168.132.65   argela   <none>           <none>
kube-system         calico-node-hsfbl                               1/1     Running            0          9h    192.168.20.225   argela   <none>           <none>
kube-system         calico-node-m6qw2                               1/1     Running            0          8h    192.168.20.232   worker   <none>           <none>
kube-system         coredns-74ff55c5b-dbtct                         1/1     Running            0          9h    192.168.132.66   argela   <none>           <none>
kube-system         coredns-74ff55c5b-wkgdx                         1/1     Running            0          9h    192.168.132.67   argela   <none>           <none>
kube-system         etcd-argela                                     1/1     Running            0          9h    192.168.20.225   argela   <none>           <none>
kube-system         kube-apiserver-argela                           1/1     Running            0          9h    192.168.20.225   argela   <none>           <none>
kube-system         kube-controller-manager-argela                  1/1     Running            0          9h    192.168.20.225   argela   <none>           <none>
kube-system         kube-proxy-p5nlm                                1/1     Running            0          9h    192.168.20.225   argela   <none>           <none>
kube-system         kube-proxy-pbp6r                                1/1     Running            0          8h    192.168.20.232   worker   <none>           <none>
kube-system         kube-scheduler-argela                           1/1     Running            0          9h    192.168.20.225   argela   <none>           <none>

It seems there is a problem with readiness and liveness url checks but there is no log for these pods.
For example gatekeeper-audit-54b5f86d57-nkz6q pod is in " 0/1 Running" status but no log for troubleshooting:

argela@argela:~/gk$ kubectl -n gatekeeper-system  logs  gatekeeper-audit-54b5f86d57-nkz6q
argela@argela:~/gk$

I've disabled firewall, to be sure I installed nginx, it installed successfully on both nodes. No ERROR line for calico.

Last State:     Terminated
      Reason:       Error
      Exit Code:    2

I searched for exit code 2, one of the article says it is an error from application.

If you want other details i can provide.
This is a new setup for CKS, on Vmware.

Thanks

serewicz · August 2021

Hello,

From what I can see you have one running (gatekeeper-controller-manager-5b96bd668-88scv), and the other two in error state.

What is the size of your VM, in CPU and Memory? When you ran kubectl create -f gatekeeper.yaml did you see any errors?

Are you sure you have no firewall blocking traffic between your VMs? It seems the failed pods are all on your worker, what is the condition of that node? Does htop show it has enough resources?

Use the same troubleshooting skills from your CKA to find the error so we can work out the fix.

Regards,

yavuz · August 2021

Hi, ok i'll try more, thank you.

Yes, if gatekeeper pod scheduled on master node, it becomes running state with no problem, if it is scheduled on worker node CrashLoopBackOff occured.

here is the resource status:

master:

argela@argela:~/gk$ free -m
              total        used        free      shared  buff/cache   available
Mem:           7976        1285        1679           2        5011        6742
Swap:             0           0           0
argela@argela:~/gk$ lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              3
On-line CPU(s) list: 0-2
Thread(s) per core:  1
Core(s) per socket:  1
Socket(s):           3
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               85
Model name:          Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz
Stepping:            4
CPU MHz:             2294.609
BogoMIPS:            4589.21
Hypervisor vendor:   VMware
Virtualization type: full
L1d cache:           32K
L1i cache:           32K
L2 cache:            1024K
L3 cache:            25344K
NUMA node0 CPU(s):   0-2
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts nopl xtopology tsc_reliable nonstop_tsc cpuid pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single pti ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 avx2 smep bmi2 invpcid avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xsaves arat pku ospke md_clear flush_l1d arch_capabilities
argela@argela:~/gk$

  1  [##**                                                                            2.7%]   Tasks: 92, 401 thr; 2 running
  2  [###**                                                                           4.6%]   Load average: 1.04 0.81 0.48 
  3  [###**                                                                           5.3%]   Uptime: 2 days, 07:48:25
  Mem[||||||||||||||####**************************************************     1.26G/7.79G]
  Swp[                                                                               0K/0K]

worker:

  1  [##                                                                              2.0%]   Tasks: 60, 196 thr; 1 running
  2  [##**                                                                            2.7%]   Load average: 0.03 0.06 0.07 
  3  [##*                                                                             2.0%]   Uptime: 09:33:19
  Mem[|||||##***************                                                    446M/7.79G]
  Swp[                                                                               0K/0K]

argela@worker:~$ free -m
              total        used        free      shared  buff/cache   available
Mem:           7976         468        5979           1        1528        7281
Swap:             0           0           0
argela@worker:~$ lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              3
On-line CPU(s) list: 0-2
Thread(s) per core:  1
Core(s) per socket:  1
Socket(s):           3
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               85
Model name:          Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz
Stepping:            4
CPU MHz:             2294.609
BogoMIPS:            4589.21
Hypervisor vendor:   VMware
Virtualization type: full
L1d cache:           32K
L1i cache:           32K
L2 cache:            1024K
L3 cache:            25344K
NUMA node0 CPU(s):   0-2
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts nopl xtopology tsc_reliable nonstop_tsc cpuid pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single pti ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 avx2 smep bmi2 invpcid avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xsaves arat pku ospke md_clear flush_l1d arch_capabilities
argela@worker:~$

serewicz · August 2021

Hello,

Check your nodes, see if there is some sort of taint causing a failure on the worker. I'd also double-check you don't have a firewall in the way between nodes, or running on one of your nodes. When you create a deployment running nginx and scale it up, do the pods run when on the worker?

What other logs, events, and describe output do you find for pods running on your worker?

Regards,

yavuz · August 2021

Hi again, there is a problem with Liveness and Readiness control but couldn't find any cause as there is no application log.
Here is the information from system:

argela@argela:~/gk$ kubectl logs  -n gatekeeper-system  gatekeeper-audit-54b5f86d57-c4fpm  
argela@argela:~/gk$

Events:

gatekeeper-system   0s          Normal    Pulled                    pod/gatekeeper-audit-54b5f86d57-c4fpm                Successfully pulled image "openpolicyagent/gatekeeper:v3.3.0" in 1.680295184s
gatekeeper-system   0s          Normal    Created                   pod/gatekeeper-audit-54b5f86d57-c4fpm                Created container manager
gatekeeper-system   0s          Normal    Started                   pod/gatekeeper-audit-54b5f86d57-c4fpm                Started container manager
gatekeeper-system   0s          Warning   Unhealthy                 pod/gatekeeper-controller-manager-5b96bd668-c8vfp    Liveness probe failed: Get "http://192.168.171.83:9090/healthz": dial tcp 192.168.171.83:9090: connect: connection refused
gatekeeper-system   0s          Warning   Unhealthy                 pod/gatekeeper-controller-manager-5b96bd668-c8vfp    Readiness probe failed: Get "http://192.168.171.83:9090/readyz": dial tcp 192.168.171.83:9090: connect: connection refused
gatekeeper-system   0s          Warning   Unhealthy                 pod/gatekeeper-audit-54b5f86d57-c4fpm                Liveness probe failed: Get "http://192.168.171.84:9090/healthz": dial tcp 192.168.171.84:9090: connect: connection refused
gatekeeper-system   0s          Warning   Unhealthy                 pod/gatekeeper-controller-manager-5b96bd668-5sd92    Liveness probe failed: Get "http://192.168.171.82:9090/healthz": dial tcp 192.168.171.82:9090: connect: connection refused
gatekeeper-system   0s          Warning   Unhealthy                 pod/gatekeeper-controller-manager-5b96bd668-5sd92    Readiness probe failed: Get "http://192.168.171.82:9090/readyz": dial tcp 192.168.171.82:9090: connect: connection refused
gatekeeper-system   0s          Warning   Unhealthy                 pod/gatekeeper-audit-54b5f86d57-c4fpm                Readiness probe failed: Get "http://192.168.171.84:9090/readyz": dial tcp 192.168.171.84:9090: connect: connection refused
gatekeeper-system   0s          Warning   Unhealthy                 pod/gatekeeper-controller-manager-5b96bd668-c8vfp    Liveness probe failed: Get "http://192.168.171.83:9090/healthz": dial tcp 192.168.171.83:9090: connect: connection refused
gatekeeper-system   0s          Warning   Unhealthy                 pod/gatekeeper-controller-manager-5b96bd668-c8vfp    Readiness probe failed: Get "http://192.168.171.83:9090/readyz": dial tcp 192.168.171.83:9090: connect: connection refused

Other information

argela@argela:~$ kubectl describe nodes | grep -i Taint
Taints:
Taints:

master:
argela@argela:~$ systemctl status ufw
ufw.service - Uncomplicated firewall
Loaded: loaded (/lib/systemd/system/ufw.service; disabled; vendor preset: enabled)
Active: inactive (dead)
Docs: man:ufw(8)
argela@argela:~$ systemctl status apparmor.service
â— apparmor.service - AppArmor initialization
Loaded: loaded (/lib/systemd/system/apparmor.service; disabled; vendor preset: enabled)
Active: inactive (dead)
Docs: man:apparmor(7)
http://wiki.apparmor.net/
argela@argela:~$

worker:
argela@worker:~$ systemctl status ufw
â— ufw.service - Uncomplicated firewall
Loaded: loaded (/lib/systemd/system/ufw.service; disabled; vendor preset: enabled)
Active: inactive (dead)
Docs: man:ufw(8)
argela@worker:~$ systemctl status apparmor.service
â— apparmor.service - AppArmor initialization
Loaded: loaded (/lib/systemd/system/apparmor.service; disabled; vendor preset: enabled)
Active: inactive (dead)
Docs: man:apparmor(7)
http://wiki.apparmor.net/

ngingx deployment: (scheduled on both nodes)

NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE EADINESS GATES

default nginx-deployment-66b6c48dd5-266gb 1/1 Running 0 5s 192.168.132.75 argela
default nginx-deployment-66b6c48dd5-724gh 1/1 Running 0 25s 192.168.171.76 worker
default nginx-deployment-66b6c48dd5-8fp49 1/1 Running 0 25s 192.168.132.73 argela
default nginx-deployment-66b6c48dd5-8hfwn 1/1 Running 0 35s 192.168.171.73 worker
default nginx-deployment-66b6c48dd5-cdgtf 1/1 Running 0 104s 192.168.171.69 worker
default nginx-deployment-66b6c48dd5-csfcw 1/1 Running 0 35s 192.168.171.74 worker
default nginx-deployment-66b6c48dd5-d6rml 1/1 Running 0 5s 192.168.132.74 argela
default nginx-deployment-66b6c48dd5-f8gvk 1/1 Running 0 55s 192.168.171.71 worker
default nginx-deployment-66b6c48dd5-lbjqc 1/1 Running 0 55s 192.168.171.72 worker
default nginx-deployment-66b6c48dd5-mlgj4 1/1 Running 0 104s 192.168.171.70 worker
default nginx-deployment-66b6c48dd5-pw87h 1/1 Running 0 5s 192.168.171.79 worker
default nginx-deployment-66b6c48dd5-t2hw9 1/1 Running 0 25s 192.168.171.77 worker
default nginx-deployment-66b6c48dd5-tz95g 1/1 Running 0 104s 192.168.171.68 worker
default nginx-deployment-66b6c48dd5-xw6rd 1/1 Running 0 25s 192.168.171.75 worker
default nginx-deployment-66b6c48dd5-zxmkm 1/1 Running 0 25s 192.168.171.78 worker

Thansk for your help
Regards,
Yavuz

serewicz · August 2021

Hello,

I saw that you checked the firewall on the VMs themselves, is there one on the host that is limiting traffic between the VMs?

The gatekeep-controller-manager running on the control plane is working, but the two on the worker node are not. This leads me to think the traffic they are using to talk to each other is being blocked in some way. Have you worked with networkpolicies yet on this cluster? Those don't show when you look at the OS firewall. It really seems like something is between the two nodes causing issues.

Does wireshark show gatekeeper traffic leaving the worker and also being accepted on the control plane? Is there any chance there is overlap between the 192.168 range and the VMs?

Regards,

yavuz · August 2021

Hi, i've changed "--pod-network-cidr" to 10.0.0.33/16 and it is ok now.
I'm not sure, is it a problem if ip addresses of nodes are in calico's network cidr range.
I mean my ip addresses are 192.168.20.225 and 232, they are in 192.168.0.0/16 range.
Anyway it seems problems is resoved.
Thanks for your guidence.

Every 2.0s: kubectl get pods -A -o wide                                                                                                                                                                      argela: Sat Aug  7 10:31:41 2021
 
NAMESPACE           NAME                                            READY   STATUS    RESTARTS   AGE     IP               NODE     NOMINATED NODE   READINESS GATES          
gatekeeper-system   gatekeeper-audit-54b5f86d57-pjcfd               1/1     Running   0          87s     10.0.171.65      worker   <none>           <none>          
gatekeeper-system   gatekeeper-controller-manager-5b96bd668-dd2cq   1/1     Running   0          87s     10.0.132.68      argela   <none>           <none>          
gatekeeper-system   gatekeeper-controller-manager-5b96bd668-pftv8   1/1     Running   0          87s     10.0.171.67      worker   <none>           <none>          
gatekeeper-system   gatekeeper-controller-manager-5b96bd668-vnrj4   1/1     Running   0          87s     10.0.171.66      worker   <none>           <none>          
kube-system         calico-kube-controllers-5f6cfd688c-6r48g        1/1     Running   0          5m7s    10.0.132.66      argela   <none>           <none>          
kube-system         calico-node-4hr2r                               1/1     Running   0          3m8s    192.168.20.232   worker   <none>           <none>          
kube-system         calico-node-lmkq6                               1/1     Running   0          5m7s    192.168.20.225   argela   <none>           <none>          
kube-system         coredns-74ff55c5b-f6nl5                         1/1     Running   0          5m7s    10.0.132.67      argela   <none>           <none>          
kube-system         coredns-74ff55c5b-hzc8f                         1/1     Running   0          5m7s    10.0.132.65      argela   <none>           <none>          
kube-system         etcd-argela                                     1/1     Running   0          5m15s   192.168.20.225   argela   <none>           <none>          
kube-system         kube-apiserver-argela                           1/1     Running   0          5m15s   192.168.20.225   argela   <none>           <none>          
kube-system         kube-controller-manager-argela                  1/1     Running   0          5m15s   192.168.20.225   argela   <none>           <none>          
kube-system         kube-proxy-h6wt2                                1/1     Running   0          3m8s    192.168.20.232   worker   <none>           <none>          
kube-system         kube-proxy-z8nk9                                1/1     Running   0          5m7s    192.168.20.225   argela   <none>           <none>          
kube-system         kube-scheduler-argela                           1/1     Running   0          5m15s   192.168.20.225   argela   <none>           <none>

regards,

serewicz · August 2021

Hello,

Yes, your routing table was sending the egress worker traffic to the host not the control plane.

Regards,

gatekeeper CrashLoopBackOff

Welcome!

Comments

Welcome!

Welcome!

Quick Links

Categories

Upcoming Training

Kubernetes Administration (LFS458)

Linux System Administration (LFS301)

Open Source Virtualization (LFS462)

Linux Kernel Debugging and Security (LFD440)