Welcome to the Linux Foundation Forum!

Problem with calico-kube-controller (Lab 4.1)

Hi.

After upgrading the cp node successfully, I proceed to upgrade the worker node and I get that error:

  1. jose@k8scp:~$ kubectl drain k8swrk --ignore-daemonsets
  2. node/k8swrk already cordoned
  3. WARNING: ignoring DaemonSet-managed Pods: kube-system/calico-node-j8ntn, kube-system/kube-proxy-2gl5s
  4. evicting pod kube-system/calico-kube-controllers-6b9fbfff44-4lmlh
  5. error when evicting pods/"calico-kube-controllers-6b9fbfff44-4lmlh" -n "kube-system" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
  6. evicting pod kube-system/calico-kube-controllers-6b9fbfff44-4lmlh
  7. error when evicting pods/"calico-kube-controllers-6b9fbfff44-4lmlh" -n "kube-system" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
  8. evicting pod kube-system/calico-kube-controllers-6b9fbfff44-4lmlh
  9. error when evicting pods/"calico-kube-controllers-6b9fbfff44-4lmlh" -n "kube-system" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
  10. ^C
  11. .........

Until now the Lab was going well and the upgrade process on the cp node went fine:

  1. jose@k8scp:~$ kubectl get node
  2. NAME STATUS ROLES AGE VERSION
  3. k8scp Ready control-plane,master 39h v1.22.1
  4. k8swrk Ready,SchedulingDisabled <none> 39h v1.21.1
  5.  

Then I uncordon k8swrk and:

  1. jose@k8scp:~$ kubectl get nodes
  2. NAME STATUS ROLES AGE VERSION
  3. k8scp Ready control-plane,master 42h v1.22.1
  4. k8swrk Ready <none> 41h v1.22.1
  5.  

I ignored the issue because it seemed to go fine and I din't notice any problem on my installation. But after continuing with lab 4.2 I saw the output of some commands that worried me, for example:

  1. jose@k8scp:~$ kubectl -n kube-system get pods -o wide
  2. NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
  3. calico-kube-controllers-6b9fbfff44-4lmlh 0/1 CrashLoopBackOff 58 (2m52s ago) 3h13m 192.168.164.138 k8swrk <none> <none>
  4. calico-node-j8ntn 1/1 Running 5 (82m ago) 41h 192.168.122.3
  5. k8swrk <none> <none>
  6. calico-node-tnffg 1/1 Running 5 41h 192.168.122.2 k8scp <none> <none>
  7. coredns-78fcd69978-hz2kl 1/1 Running 2 (82m ago) 173m 192.168.74.146 k8scp <none> <none>
  8. coredns-78fcd69978-mczhs 1/1 Running 2 (82m ago) 173m 192.168.74.147 k8scp <none> <none>
  9.  
  10. <ommited>
  11.  

And, as it can be observed, the calico-kube-controllers are with a CrashLoopBackOff status and I suspect that it is not a good signal.

What is wrong with that?

I tried kubectl drain again but with the same results.

Welcome!

It looks like you're new here. Sign in or register to get started.
Sign In

Comments

  • Posts: 19
    edited December 2021

    I'm doing again Lab 4.1 from a snapshot of my VM's and, after upgrading the cp node, on step 15 I noticed that the problem is there again. I issued the command kubectl uncordon k8scp but that didn't help. I append the output of all this for better debugging:

    1. jose@k8scp:~$ kubectl get node
    2. NAME STATUS ROLES AGE VERSION
    3. k8scp Ready,SchedulingDisabled control-plane,master 46h v1.22.1
    4. k8swrk Ready <none> 46h v1.21.1
    5. jose@k8scp:~$ kubectl -n kube-system get pods -o wide
    6. NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
    7. calico-kube-controllers-6b9fbfff44-cwk6z 0/1 CrashLoopBackOff 6 (2m53s ago) 10m 192.168.164.131 k8swrk <none> <none>
    8. calico-node-j8ntn 1/1 Running 2 (23h ago) 46h 192.168.122.3 k8swrk <none> <none>
    9. calico-node-tnffg 1/1 Running 2 (23h ago) 46h 192.168.122.2 k8scp <none> <none>
    10. coredns-558bd4d5db-d22ht 0/1 Running 0 10m 192.168.164.130 k8swrk <none> <none>
    11. coredns-78fcd69978-87kzd 0/1 Running 0 5m30s 192.168.164.134 k8swrk <none> <none>
    12. coredns-78fcd69978-sbzck 0/1 Running 0 5m30s 192.168.164.135 k8swrk <none> <none>
    13. etcd-k8scp 1/1 Running 0 7m6s 192.168.122.2 k8scp <none> <none>
    14. kube-apiserver-k8scp 1/1 Running 0 6m22s 192.168.122.2 k8scp <none> <none>
    15. kube-controller-manager-k8scp 1/1 Running 0 5m59s 192.168.122.2 k8scp <none> <none>
    16. kube-proxy-bxbqz 1/1 Running 0 5m24s 192.168.122.3 k8swrk <none> <none>
    17. kube-proxy-jmt4k 1/1 Running 0 4m57s 192.168.122.2 k8scp <none> <none>
    18. kube-scheduler-k8scp 1/1 Running 0 5m45s 192.168.122.2 k8scp <none> <none>
    19. jose@k8scp:~$ kubectl uncordon k8scp
    20. node/k8scp uncordoned
    21. jose@k8scp:~$ kubectl get node
    22. NAME STATUS ROLES AGE VERSION
    23. k8scp Ready control-plane,master 46h v1.22.1
    24. k8swrk Ready <none> 46h v1.21.1
    25. jose@k8scp:~$ kubectl -n kube-system get pods -o wide
    26. NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
    27. calico-kube-controllers-6b9fbfff44-cwk6z 0/1 Running 7 (5m14s ago) 12m 192.168.164.131 k8swrk <none> <none>
    28. calico-node-j8ntn 1/1 Running 2 (24h ago) 46h 192.168.122.3 k8swrk <none> <none>
    29. calico-node-tnffg 1/1 Running 2 (24h ago) 46h 192.168.122.2 k8scp <none> <none>
    30. coredns-558bd4d5db-d22ht 0/1 Running 0 12m 192.168.164.130 k8swrk <none> <none>
    31. coredns-78fcd69978-87kzd 0/1 Running 0 7m51s 192.168.164.134 k8swrk <none> <none>
    32. coredns-78fcd69978-sbzck 0/1 Running 0 7m51s 192.168.164.135 k8swrk <none> <none>
    33. etcd-k8scp 1/1 Running 0 9m27s 192.168.122.2 k8scp <none> <none>
    34. kube-apiserver-k8scp 1/1 Running 0 8m43s 192.168.122.2 k8scp <none> <none>
    35. kube-controller-manager-k8scp 1/1 Running 0 8m20s 192.168.122.2 k8scp <none> <none>
    36. kube-proxy-bxbqz 1/1 Running 0 7m45s 192.168.122.3 k8swrk <none> <none>
    37. kube-proxy-jmt4k 1/1 Running 0 7m18s 192.168.122.2 k8scp <none> <none>
    38. kube-scheduler-k8scp 1/1 Running 0 8m6s 192.168.122.2 k8scp <none> <none>
    39. jose@k8scp:~$ kubectl -n kube-system get pods -o wide
    40. NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
    41. calico-kube-controllers-6b9fbfff44-cwk6z 0/1 CrashLoopBackOff 7 (8s ago) 12m 192.168.164.131 k8swrk <none> <none>
    42. calico-node-j8ntn 1/1 Running 2 (24h ago) 46h 192.168.122.3 k8swrk <none> <none>
    43. calico-node-tnffg 1/1 Running 2 (24h ago) 46h 192.168.122.2 k8scp <none> <none>
    44. coredns-558bd4d5db-d22ht 0/1 Running 0 12m 192.168.164.130 k8swrk <none> <none>
    45. coredns-78fcd69978-87kzd 0/1 Running 0 8m3s 192.168.164.134 k8swrk <none> <none>
    46. coredns-78fcd69978-sbzck 0/1 Running 0 8m3s 192.168.164.135 k8swrk <none> <none>
    47. etcd-k8scp 1/1 Running 0 9m39s 192.168.122.2 k8scp <none> <none>
    48. kube-apiserver-k8scp 1/1 Running 0 8m55s 192.168.122.2 k8scp <none> <none>
    49. kube-controller-manager-k8scp 1/1 Running 0 8m32s 192.168.122.2 k8scp <none> <none>
    50. kube-proxy-bxbqz 1/1 Running 0 7m57s 192.168.122.3 k8swrk <none> <none>
    51. kube-proxy-jmt4k 1/1 Running 0 7m30s 192.168.122.2 k8scp <none> <none>
    52. kube-scheduler-k8scp 1/1 Running 0 8m18s 192.168.122.2 k8scp <none> <none>

    Between the last command and the previous only a few seconds passed. The output after kubectl uncordon k8scp of the first kubectl -n kube-system get pods -o wide shows the calico-kube-controller status as "Running" but after a few seconds it showss back again "CrashLoopBackOff"

    Could it be necessary to upgrade Calico too as step 9 seems to suggest? If that is the case, I don't know how to do it and what version would run ok with upgraded versions.

  • Posts: 19
    edited December 2021

    Well, I followed the instructions described here, that redirects me here for upgrading calico installations through calico.yaml manifest, and now I think that the problem is gone. But now the calico objects belongs to a new namespace (calico-system) instead to the original (kube-system):

    1. jose@k8scp:~$ kubectl -n calico-system get pods -o wide
    2. NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
    3. calico-kube-controllers-58494599f9-pr7kn 1/1 Running 0 106s 192.168.74.138 k8scp <none> <none>
    4. calico-node-8hfkw 1/1 Running 0 47s 192.168.122.2 k8scp <none> <none>
    5. calico-node-drjf6 1/1 Running 0 35s 192.168.122.3 k8swrk <none> <none>
    6. calico-typha-66698b6b8b-whnbt 1/1 Running 0 49s 192.168.122.3 k8swrk <none> <none>

    I continued with the worker node upgrade and everything was ok and the previous errors was gone:

    1. jose@k8scp:~$ kubectl drain k8swrk --ignore-daemonsets
    2. node/k8swrk cordoned
    3. WARNING: ignoring DaemonSet-managed Pods: calico-system/calico-node-drjf6, kube-system/kube-proxy-bxbqz
    4. evicting pod kube-system/coredns-78fcd69978-sbzck
    5. evicting pod kube-system/coredns-78fcd69978-87kzd
    6. evicting pod calico-system/calico-typha-66698b6b8b-whnbt
    7. evicting pod kube-system/coredns-558bd4d5db-d22ht
    8. pod/calico-typha-66698b6b8b-whnbt evicted
    9. pod/coredns-78fcd69978-87kzd evicted
    10. pod/coredns-78fcd69978-sbzck evicted
    11. pod/coredns-558bd4d5db-d22ht evicted
    12. node/k8swrk evicted

    If this new configuration can cause me problems in the future for incompatibilities with next labs I will thank that somebody warns me.
    Otherwise, I close that thread

  • Hi @jmarinho,

    Your issues are caused by overlapping IP addresses between the node/VM IPs managed by the hypervisor and the pod IPs managed by Calico. As long as there is such overlap, your cluster will not operate successfully.

    I would recommend rebuilding your cluster and ensuring that the VM IP addresses managed by the hypervisor do not overlap the default 192.168.0.0/16 pod network managed by Calico. You could try assigning your VMs IP addresses from the 10.200.0.0/16 network to prevent any such IP address overlap.

    Regards,
    -Chris

  • Hi @chrispokorni,

    Sorry for not answering before, but I did not see the message until today.
    Thank for your advice. You're right and that was the problem. I did not pay attention at the subnet mask, a very silly mistake.
    I thought that upgrading Calico as I mentioned solved the issue and for some reason that I don't know, it seemed that. I did not have any problems after that but, before I noticed your answer, I was having problems installing linkerd on lab 11.1 and probably was related with that.
    As I finally had to rebuild my cluster, I'm redoing the labs and, when I get lab 11 I will see if that was the problem.

    Regards
    Jose

  • Posts: 2

    I am glad to find this note. I was precisely at the stage of upgrading worker node. I have spent about 6 days and while troubleshooting learned a lot but not what I needed.

    I hope that @Chris could comment.

    AWS master-worker node

    ubuntu@ip-172-31-25-66:~$ ku -n kube-system get po -o wide

    NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
    calico-kube-controllers-685b65ddf9-rlcrv 0/1 CrashLoopBackOff 23 (84s ago) 98m 172.31.104.240 ip-172-31-26-1
    calico-node-gm8sz 1/1 Running 0 3h43m 172.31.26.1 ip-172-31-26-1
    calico-node-xnqnx 1/1 Running 1 (3h17m ago) 3h43m 172.31.25.66 ip-172-31-25-66
    coredns-64897985d-fz4nj 1/1 Running 6 (3h17m ago) 7d23h 172.31.52.227 ip-172-31-25-66
    coredns-64897985d-shblb 1/1 Running 6 (3h17m ago) 7d23h 172.31.52.228 ip-172-31-25-66
    etcd-ip-172-31-25-66 1/1 Running 7 (3h17m ago) 8d 172.31.25.66 ip-172-31-25-66
    kube-apiserver-ip-172-31-25-66 1/1 Running 10 (3h17m ago) 8d 172.31.25.66 ip-172-31-25-66
    kube-controller-manager-ip-172-31-25-66 1/1 Running 6 (3h17m ago) 8d 172.31.25.66 ip-172-31-25-66
    kube-proxy-x9xrc 1/1 Running 6 (3h17m ago) 8d 172.31.25.66 ip-172-31-25-66
    kube-proxy-zcq2t 1/1 Running 4 (4h41m ago) 8d 172.31.26.1 ip-172-31-26-1
    kube-scheduler-ip-172-31-25-66 1/1 Running 6 (3h17m ago) 8d
    172.31.25.66 ip-172-31-25-66

    I am getting the following errors

    DESCRIBE

    ku -n kube-system describe po calico-kube-controllers-685b65ddf9-rlcrv
    Warning Unhealthy 29m (x10 over 30m) kubelet Readiness probe failed: Failed to read status file /status/status.json: unexpected end of JSON input
    Normal Pulled 29m (x4 over 30m) kubelet Container image "docker.io/calico/kube-controllers:v3.23.1" already present on machine
    Warning BackOff 40s (x136 over 30m) kubelet Back-off restarting failed container

    LOG KUBE-PROXY
    ku -n kube-system logs kube-proxy-x9xrc
    E0703 17:58:39.111615 1 proxier.go:1600] "can't open port, skipping it" err="listen tcp4 :31107: bind: address already in use" port={Description:nodePort for default/nginx IP: IPFamily:4 Port:31107 Protocol:TCP}

    DESCRIBE
    ku -n kube-system describe po kube-proxy-x9xrc
    kube-proxy:
    Container ID: docker://fb063dc9345ee6e122f00c00265e7c41e5f330a240db855a0c580b71823207e7
    Image: k8s.gcr.io/kube-proxy:v1.23.1
    Image ID: docker-pullable://k8s.gcr.io/kube-proxy@sha256:e40f3a28721588affcf187f3f246d1e078157dabe274003eaa2957a83f7170c8
    Port:
    Host Port:
    ku -n kube-system describe po kube-proxy-zcq2t
    Containers:
    kube-proxy:
    Container ID: docker://a4163a3b6548904078d592ade2f948d0a96bb566863cbddbd153a5fa18fd0300
    Image: k8s.gcr.io/kube-proxy:v1.23.1
    Image ID: docker-pullable://k8s.gcr.io/kube-proxy@sha256:e40f3a28721588affcf187f3f246d1e078157dabe274003eaa2957a83f7170c8
    Port:
    Host Port:

    ku -n kube-system logs kube-proxy-zcq2t
    E0703 16:35:32.334879 1 proxier.go:1600] "can't open port, skipping it" err="listen tcp4 :31107: bind: address already in use" port={Description:nodePort for default/nginx IP: IPFamily:4 Port:31107 Protocol:TCP}

    ku -n kube-system logs calico-node-gm8sz
    2022-07-03 20:22:18.097 [INFO][71] monitor-addresses/autodetection_methods.go 103: Using autodetected IPv4 address on interface eth0: 172.31.26.1/20
    2022-07-03 20:22:22.413 [INFO][68] felix/summary.go 100: Summarising 12 dataplane reconciliation loops over 1m2.5s: avg=4ms longest=8ms (resync-nat-v4)

    ku -n kube-system logs calico-node-xnqnx
    2022-07-03 20:24:44.924 [INFO][71] monitor-addresses/autodetection_methods.go 103: Using autodetected IPv4 address on interface eth0: 172.31.25.66/20
    2022-07-03 20:24:47.346 [INFO][66] felix/summary.go 100: Summarising 11 dataplane reconciliation loops over 1m3.4s: avg=4ms longest=12ms ()

    error: unable to upgrade connection: container not found ("calico-kube-controllers")

    I could use some insight!

  • Posts: 2

    @Chris, please ignore. I reviewed my steps Vs. your instructions and the error was mine, indeed the IP overlap did occur. I am re-building the cluster. I am glad i was stuck for 6 days since I learned much. Thx.

Welcome!

It looks like you're new here. Sign in or register to get started.
Sign In

Welcome!

It looks like you're new here. Sign in or register to get started.
Sign In

Categories

Upcoming Training