Welcome to the Linux Foundation Forum!

gatekeeper CrashLoopBackOff

Hi,
I setup two Ubuntu 18.04 server, one is master, one is worker as described on tutorial.
When i run
kubectl create -f gatekeeper.yaml
pods that scheduled on master node become RUNNING state but on worker node they are not.

State:          Waiting
  Reason:       CrashLoopBackOff
Last State:     Terminated
  Reason:       Error
  Exit Code:    2
  Started:      Wed, 04 Aug 2021 14:26:11 +0000
  Finished:     Wed, 04 Aug 2021 14:26:39 +0000
Ready:          False
Restart Count:  9
Limits:
  cpu:     1
  memory:  512Mi
Requests:
  cpu:      100m
  memory:   256Mi
Liveness:   http-get http://:9090/healthz delay=0s timeout=1s period=10s #success=1 #failure=3
Readiness:  http-get http://:9090/readyz delay=0s timeout=1s period=10s #success=1 #failure=3

I tried to install nginx to see if there is a problem on worker node but nginx pods on worker node are in running mode, so it seems there is no problem about calico.

Regards

Comments

  • serewicz
    serewicz Posts: 947

    Hello,

    We would need more information in order to troubleshoot the problem. Are all of your pods running, any errors?

    Did you use the same lab setup when you were studying for the CKA? Any problems then? What is you VM provider setup, AWS, GCE, VirtualBox, VMWare? Are you sure there is no firewall between nodes?

    I'd use the skills you got from your CKA to figure out the error, which would let me help find the fix.

    Regards,

  • yavuz
    yavuz Posts: 6
    edited August 4

    Thanks for fast reply.
    Here is the pod status

    [email protected]:~/gk$ kubectl get pods -A -o wide
    NAMESPACE           NAME                                            READY   STATUS             RESTARTS   AGE   IP               NODE     NOMINATED NODE   READINESS GATES
    gatekeeper-system   gatekeeper-audit-54b5f86d57-nkz6q               0/1     Running            16         37m   192.168.171.72   worker   <none>           <none>
    gatekeeper-system   gatekeeper-controller-manager-5b96bd668-88scv   1/1     Running            0          37m   192.168.132.69   argela   <none>           <none>
    gatekeeper-system   gatekeeper-controller-manager-5b96bd668-f8z6j   0/1     CrashLoopBackOff   15         37m   192.168.171.73   worker   <none>           <none>
    gatekeeper-system   gatekeeper-controller-manager-5b96bd668-sdv2n   0/1     CrashLoopBackOff   15         37m   192.168.171.74   worker   <none>           <none>
    kube-system         calico-kube-controllers-5f6cfd688c-sl5d5        1/1     Running            0          9h    192.168.132.65   argela   <none>           <none>
    kube-system         calico-node-hsfbl                               1/1     Running            0          9h    192.168.20.225   argela   <none>           <none>
    kube-system         calico-node-m6qw2                               1/1     Running            0          8h    192.168.20.232   worker   <none>           <none>
    kube-system         coredns-74ff55c5b-dbtct                         1/1     Running            0          9h    192.168.132.66   argela   <none>           <none>
    kube-system         coredns-74ff55c5b-wkgdx                         1/1     Running            0          9h    192.168.132.67   argela   <none>           <none>
    kube-system         etcd-argela                                     1/1     Running            0          9h    192.168.20.225   argela   <none>           <none>
    kube-system         kube-apiserver-argela                           1/1     Running            0          9h    192.168.20.225   argela   <none>           <none>
    kube-system         kube-controller-manager-argela                  1/1     Running            0          9h    192.168.20.225   argela   <none>           <none>
    kube-system         kube-proxy-p5nlm                                1/1     Running            0          9h    192.168.20.225   argela   <none>           <none>
    kube-system         kube-proxy-pbp6r                                1/1     Running            0          8h    192.168.20.232   worker   <none>           <none>
    kube-system         kube-scheduler-argela                           1/1     Running            0          9h    192.168.20.225   argela   <none>           <none>
    
    

    It seems there is a problem with readiness and liveness url checks but there is no log for these pods.
    For example gatekeeper-audit-54b5f86d57-nkz6q pod is in " 0/1 Running" status but no log for troubleshooting:

    [email protected]:~/gk$ kubectl -n gatekeeper-system  logs  gatekeeper-audit-54b5f86d57-nkz6q
    [email protected]:~/gk$
    

    I've disabled firewall, to be sure I installed nginx, it installed successfully on both nodes. No ERROR line for calico.

    Last State:     Terminated
          Reason:       Error
          Exit Code:    2
    

    I searched for exit code 2, one of the article says it is an error from application.

    If you want other details i can provide.
    This is a new setup for CKS, on Vmware.

    Thanks

  • serewicz
    serewicz Posts: 947

    Hello,

    From what I can see you have one running (gatekeeper-controller-manager-5b96bd668-88scv), and the other two in error state.

    What is the size of your VM, in CPU and Memory? When you ran kubectl create -f gatekeeper.yaml did you see any errors?

    Are you sure you have no firewall blocking traffic between your VMs? It seems the failed pods are all on your worker, what is the condition of that node? Does htop show it has enough resources?

    Use the same troubleshooting skills from your CKA to find the error so we can work out the fix.

    Regards,

  • yavuz
    yavuz Posts: 6

    Hi, ok i'll try more, thank you.

    Yes, if gatekeeper pod scheduled on master node, it becomes running state with no problem, if it is scheduled on worker node CrashLoopBackOff occured.

    here is the resource status:

    master:

    [email protected]:~/gk$ free -m
                  total        used        free      shared  buff/cache   available
    Mem:           7976        1285        1679           2        5011        6742
    Swap:             0           0           0
    [email protected]:~/gk$ lscpu
    Architecture:        x86_64
    CPU op-mode(s):      32-bit, 64-bit
    Byte Order:          Little Endian
    CPU(s):              3
    On-line CPU(s) list: 0-2
    Thread(s) per core:  1
    Core(s) per socket:  1
    Socket(s):           3
    NUMA node(s):        1
    Vendor ID:           GenuineIntel
    CPU family:          6
    Model:               85
    Model name:          Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz
    Stepping:            4
    CPU MHz:             2294.609
    BogoMIPS:            4589.21
    Hypervisor vendor:   VMware
    Virtualization type: full
    L1d cache:           32K
    L1i cache:           32K
    L2 cache:            1024K
    L3 cache:            25344K
    NUMA node0 CPU(s):   0-2
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts nopl xtopology tsc_reliable nonstop_tsc cpuid pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single pti ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 avx2 smep bmi2 invpcid avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xsaves arat pku ospke md_clear flush_l1d arch_capabilities
    [email protected]:~/gk$
    
      1  [##**                                                                            2.7%]   Tasks: 92, 401 thr; 2 running
      2  [###**                                                                           4.6%]   Load average: 1.04 0.81 0.48 
      3  [###**                                                                           5.3%]   Uptime: 2 days, 07:48:25
      Mem[||||||||||||||####**************************************************     1.26G/7.79G]
      Swp[                                                                               0K/0K]
    

    worker:

      1  [##                                                                              2.0%]   Tasks: 60, 196 thr; 1 running
      2  [##**                                                                            2.7%]   Load average: 0.03 0.06 0.07 
      3  [##*                                                                             2.0%]   Uptime: 09:33:19
      Mem[|||||##***************                                                    446M/7.79G]
      Swp[                                                                               0K/0K]
    
    [email protected]:~$ free -m
                  total        used        free      shared  buff/cache   available
    Mem:           7976         468        5979           1        1528        7281
    Swap:             0           0           0
    [email protected]:~$ lscpu
    Architecture:        x86_64
    CPU op-mode(s):      32-bit, 64-bit
    Byte Order:          Little Endian
    CPU(s):              3
    On-line CPU(s) list: 0-2
    Thread(s) per core:  1
    Core(s) per socket:  1
    Socket(s):           3
    NUMA node(s):        1
    Vendor ID:           GenuineIntel
    CPU family:          6
    Model:               85
    Model name:          Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz
    Stepping:            4
    CPU MHz:             2294.609
    BogoMIPS:            4589.21
    Hypervisor vendor:   VMware
    Virtualization type: full
    L1d cache:           32K
    L1i cache:           32K
    L2 cache:            1024K
    L3 cache:            25344K
    NUMA node0 CPU(s):   0-2
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts nopl xtopology tsc_reliable nonstop_tsc cpuid pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single pti ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 avx2 smep bmi2 invpcid avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xsaves arat pku ospke md_clear flush_l1d arch_capabilities
    [email protected]:~$ 
    
  • serewicz
    serewicz Posts: 947

    Hello,

    Check your nodes, see if there is some sort of taint causing a failure on the worker. I'd also double-check you don't have a firewall in the way between nodes, or running on one of your nodes. When you create a deployment running nginx and scale it up, do the pods run when on the worker?

    What other logs, events, and describe output do you find for pods running on your worker?

    Regards,

  • yavuz
    yavuz Posts: 6

    Hi again, there is a problem with Liveness and Readiness control but couldn't find any cause as there is no application log.
    Here is the information from system:

    [email protected]:~/gk$ kubectl logs  -n gatekeeper-system  gatekeeper-audit-54b5f86d57-c4fpm  
    [email protected]:~/gk$ 
    

    Events:

    gatekeeper-system   0s          Normal    Pulled                    pod/gatekeeper-audit-54b5f86d57-c4fpm                Successfully pulled image "openpolicyagent/gatekeeper:v3.3.0" in 1.680295184s
    gatekeeper-system   0s          Normal    Created                   pod/gatekeeper-audit-54b5f86d57-c4fpm                Created container manager
    gatekeeper-system   0s          Normal    Started                   pod/gatekeeper-audit-54b5f86d57-c4fpm                Started container manager
    gatekeeper-system   0s          Warning   Unhealthy                 pod/gatekeeper-controller-manager-5b96bd668-c8vfp    Liveness probe failed: Get "http://192.168.171.83:9090/healthz": dial tcp 192.168.171.83:9090: connect: connection refused
    gatekeeper-system   0s          Warning   Unhealthy                 pod/gatekeeper-controller-manager-5b96bd668-c8vfp    Readiness probe failed: Get "http://192.168.171.83:9090/readyz": dial tcp 192.168.171.83:9090: connect: connection refused
    gatekeeper-system   0s          Warning   Unhealthy                 pod/gatekeeper-audit-54b5f86d57-c4fpm                Liveness probe failed: Get "http://192.168.171.84:9090/healthz": dial tcp 192.168.171.84:9090: connect: connection refused
    gatekeeper-system   0s          Warning   Unhealthy                 pod/gatekeeper-controller-manager-5b96bd668-5sd92    Liveness probe failed: Get "http://192.168.171.82:9090/healthz": dial tcp 192.168.171.82:9090: connect: connection refused
    gatekeeper-system   0s          Warning   Unhealthy                 pod/gatekeeper-controller-manager-5b96bd668-5sd92    Readiness probe failed: Get "http://192.168.171.82:9090/readyz": dial tcp 192.168.171.82:9090: connect: connection refused
    gatekeeper-system   0s          Warning   Unhealthy                 pod/gatekeeper-audit-54b5f86d57-c4fpm                Readiness probe failed: Get "http://192.168.171.84:9090/readyz": dial tcp 192.168.171.84:9090: connect: connection refused
    gatekeeper-system   0s          Warning   Unhealthy                 pod/gatekeeper-controller-manager-5b96bd668-c8vfp    Liveness probe failed: Get "http://192.168.171.83:9090/healthz": dial tcp 192.168.171.83:9090: connect: connection refused
    gatekeeper-system   0s          Warning   Unhealthy                 pod/gatekeeper-controller-manager-5b96bd668-c8vfp    Readiness probe failed: Get "http://192.168.171.83:9090/readyz": dial tcp 192.168.171.83:9090: connect: connection refused
    

    Other information

    [email protected]:~$ kubectl describe nodes | grep -i Taint
    Taints:
    Taints:

    master:
    [email protected]:~$ systemctl status ufw
    ufw.service - Uncomplicated firewall
    Loaded: loaded (/lib/systemd/system/ufw.service; disabled; vendor preset: enabled)
    Active: inactive (dead)
    Docs: man:ufw(8)
    [email protected]:~$ systemctl status apparmor.service
    ● apparmor.service - AppArmor initialization
    Loaded: loaded (/lib/systemd/system/apparmor.service; disabled; vendor preset: enabled)
    Active: inactive (dead)
    Docs: man:apparmor(7)
    http://wiki.apparmor.net/
    [email protected]:~$

    worker:
    [email protected]:~$ systemctl status ufw
    ● ufw.service - Uncomplicated firewall
    Loaded: loaded (/lib/systemd/system/ufw.service; disabled; vendor preset: enabled)
    Active: inactive (dead)
    Docs: man:ufw(8)
    [email protected]:~$ systemctl status apparmor.service
    ● apparmor.service - AppArmor initialization
    Loaded: loaded (/lib/systemd/system/apparmor.service; disabled; vendor preset: enabled)
    Active: inactive (dead)
    Docs: man:apparmor(7)
    http://wiki.apparmor.net/

    ngingx deployment: (scheduled on both nodes)

    NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE EADINESS GATES

    default nginx-deployment-66b6c48dd5-266gb 1/1 Running 0 5s 192.168.132.75 argela
    default nginx-deployment-66b6c48dd5-724gh 1/1 Running 0 25s 192.168.171.76 worker
    default nginx-deployment-66b6c48dd5-8fp49 1/1 Running 0 25s 192.168.132.73 argela
    default nginx-deployment-66b6c48dd5-8hfwn 1/1 Running 0 35s 192.168.171.73 worker
    default nginx-deployment-66b6c48dd5-cdgtf 1/1 Running 0 104s 192.168.171.69 worker
    default nginx-deployment-66b6c48dd5-csfcw 1/1 Running 0 35s 192.168.171.74 worker
    default nginx-deployment-66b6c48dd5-d6rml 1/1 Running 0 5s 192.168.132.74 argela
    default nginx-deployment-66b6c48dd5-f8gvk 1/1 Running 0 55s 192.168.171.71 worker
    default nginx-deployment-66b6c48dd5-lbjqc 1/1 Running 0 55s 192.168.171.72 worker
    default nginx-deployment-66b6c48dd5-mlgj4 1/1 Running 0 104s 192.168.171.70 worker
    default nginx-deployment-66b6c48dd5-pw87h 1/1 Running 0 5s 192.168.171.79 worker
    default nginx-deployment-66b6c48dd5-t2hw9 1/1 Running 0 25s 192.168.171.77 worker
    default nginx-deployment-66b6c48dd5-tz95g 1/1 Running 0 104s 192.168.171.68 worker
    default nginx-deployment-66b6c48dd5-xw6rd 1/1 Running 0 25s 192.168.171.75 worker
    default nginx-deployment-66b6c48dd5-zxmkm 1/1 Running 0 25s 192.168.171.78 worker

    Thansk for your help
    Regards,
    Yavuz

  • serewicz
    serewicz Posts: 947

    Hello,

    I saw that you checked the firewall on the VMs themselves, is there one on the host that is limiting traffic between the VMs?

    The gatekeep-controller-manager running on the control plane is working, but the two on the worker node are not. This leads me to think the traffic they are using to talk to each other is being blocked in some way. Have you worked with networkpolicies yet on this cluster? Those don't show when you look at the OS firewall. It really seems like something is between the two nodes causing issues.

    Does wireshark show gatekeeper traffic leaving the worker and also being accepted on the control plane? Is there any chance there is overlap between the 192.168 range and the VMs?

    Regards,

  • yavuz
    yavuz Posts: 6

    Hi, i've changed "--pod-network-cidr" to 10.0.0.33/16 and it is ok now.
    I'm not sure, is it a problem if ip addresses of nodes are in calico's network cidr range.
    I mean my ip addresses are 192.168.20.225 and 232, they are in 192.168.0.0/16 range.
    Anyway it seems problems is resoved.
    Thanks for your guidence.

    Every 2.0s: kubectl get pods -A -o wide                                                                                                                                                                      argela: Sat Aug  7 10:31:41 2021
    
    NAMESPACE           NAME                                            READY   STATUS    RESTARTS   AGE     IP               NODE     NOMINATED NODE   READINESS GATES          
    gatekeeper-system   gatekeeper-audit-54b5f86d57-pjcfd               1/1     Running   0          87s     10.0.171.65      worker   <none>           <none>          
    gatekeeper-system   gatekeeper-controller-manager-5b96bd668-dd2cq   1/1     Running   0          87s     10.0.132.68      argela   <none>           <none>          
    gatekeeper-system   gatekeeper-controller-manager-5b96bd668-pftv8   1/1     Running   0          87s     10.0.171.67      worker   <none>           <none>          
    gatekeeper-system   gatekeeper-controller-manager-5b96bd668-vnrj4   1/1     Running   0          87s     10.0.171.66      worker   <none>           <none>          
    kube-system         calico-kube-controllers-5f6cfd688c-6r48g        1/1     Running   0          5m7s    10.0.132.66      argela   <none>           <none>          
    kube-system         calico-node-4hr2r                               1/1     Running   0          3m8s    192.168.20.232   worker   <none>           <none>          
    kube-system         calico-node-lmkq6                               1/1     Running   0          5m7s    192.168.20.225   argela   <none>           <none>          
    kube-system         coredns-74ff55c5b-f6nl5                         1/1     Running   0          5m7s    10.0.132.67      argela   <none>           <none>          
    kube-system         coredns-74ff55c5b-hzc8f                         1/1     Running   0          5m7s    10.0.132.65      argela   <none>           <none>          
    kube-system         etcd-argela                                     1/1     Running   0          5m15s   192.168.20.225   argela   <none>           <none>          
    kube-system         kube-apiserver-argela                           1/1     Running   0          5m15s   192.168.20.225   argela   <none>           <none>          
    kube-system         kube-controller-manager-argela                  1/1     Running   0          5m15s   192.168.20.225   argela   <none>           <none>          
    kube-system         kube-proxy-h6wt2                                1/1     Running   0          3m8s    192.168.20.232   worker   <none>           <none>          
    kube-system         kube-proxy-z8nk9                                1/1     Running   0          5m7s    192.168.20.225   argela   <none>           <none>          
    kube-system         kube-scheduler-argela                           1/1     Running   0          5m15s   192.168.20.225   argela   <none>           <none>          
    
    

    regards,

  • serewicz
    serewicz Posts: 947

    Hello,

    Yes, your routing table was sending the egress worker traffic to the host not the control plane.

    Regards,

Categories

Upcoming Training