Welcome to the Linux Foundation Forum!

Problem with calico-kube-controller (Lab 4.1)

Hi.

After upgrading the cp node successfully, I proceed to upgrade the worker node and I get that error:

[email protected]:~$ kubectl drain k8swrk --ignore-daemonsets
node/k8swrk already cordoned
WARNING: ignoring DaemonSet-managed Pods: kube-system/calico-node-j8ntn, kube-system/kube-proxy-2gl5s
evicting pod kube-system/calico-kube-controllers-6b9fbfff44-4lmlh
error when evicting pods/"calico-kube-controllers-6b9fbfff44-4lmlh" -n "kube-system" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
evicting pod kube-system/calico-kube-controllers-6b9fbfff44-4lmlh
error when evicting pods/"calico-kube-controllers-6b9fbfff44-4lmlh" -n "kube-system" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
evicting pod kube-system/calico-kube-controllers-6b9fbfff44-4lmlh
error when evicting pods/"calico-kube-controllers-6b9fbfff44-4lmlh" -n "kube-system" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
^C
.........

Until now the Lab was going well and the upgrade process on the cp node went fine:

[email protected]:~$ kubectl get node
NAME     STATUS                     ROLES                  AGE   VERSION
k8scp    Ready                      control-plane,master   39h   v1.22.1
k8swrk   Ready,SchedulingDisabled   <none>                 39h   v1.21.1

Then I uncordon k8swrk and:

[email protected]:~$ kubectl get nodes
NAME     STATUS   ROLES                  AGE   VERSION
k8scp    Ready    control-plane,master   42h   v1.22.1
k8swrk   Ready    <none>                 41h   v1.22.1

I ignored the issue because it seemed to go fine and I din't notice any problem on my installation. But after continuing with lab 4.2 I saw the output of some commands that worried me, for example:

 [email protected]:~$ kubectl -n kube-system get pods -o wide
NAME                                       READY   STATUS             RESTARTS         AGE     IP                NODE     NOMINATED NODE   READINESS GATES
calico-kube-controllers-6b9fbfff44-4lmlh   0/1     CrashLoopBackOff   58 (2m52s ago)   3h13m   192.168.164.138   k8swrk   <none>           <none>
calico-node-j8ntn                          1/1     Running            5 (82m ago)      41h     192.168.122.3     
k8swrk   <none>           <none>
calico-node-tnffg                          1/1     Running            5                41h     192.168.122.2     k8scp    <none>           <none>
coredns-78fcd69978-hz2kl                   1/1     Running            2 (82m ago)      173m    192.168.74.146    k8scp    <none>           <none>
coredns-78fcd69978-mczhs                   1/1     Running            2 (82m ago)      173m    192.168.74.147    k8scp    <none>           <none>

<ommited>

And, as it can be observed, the calico-kube-controllers are with a CrashLoopBackOff status and I suspect that it is not a good signal.

What is wrong with that?

I tried kubectl drain again but with the same results.

Comments

  • jmarinho
    jmarinho Posts: 19
    edited December 2021

    I'm doing again Lab 4.1 from a snapshot of my VM's and, after upgrading the cp node, on step 15 I noticed that the problem is there again. I issued the command kubectl uncordon k8scp but that didn't help. I append the output of all this for better debugging:

    [email protected]:~$ kubectl get node
    NAME     STATUS                     ROLES                  AGE   VERSION
    k8scp    Ready,SchedulingDisabled   control-plane,master   46h   v1.22.1
    k8swrk   Ready                      <none>                 46h   v1.21.1
    [email protected]:~$ kubectl -n kube-system get pods -o wide
    NAME                                       READY   STATUS             RESTARTS        AGE     IP                NODE     NOMINATED NODE   READINESS GATES
    calico-kube-controllers-6b9fbfff44-cwk6z   0/1     CrashLoopBackOff   6 (2m53s ago)   10m     192.168.164.131   k8swrk   <none>           <none>
    calico-node-j8ntn                          1/1     Running            2 (23h ago)     46h     192.168.122.3     k8swrk   <none>           <none>
    calico-node-tnffg                          1/1     Running            2 (23h ago)     46h     192.168.122.2     k8scp    <none>           <none>
    coredns-558bd4d5db-d22ht                   0/1     Running            0               10m     192.168.164.130   k8swrk   <none>           <none>
    coredns-78fcd69978-87kzd                   0/1     Running            0               5m30s   192.168.164.134   k8swrk   <none>           <none>
    coredns-78fcd69978-sbzck                   0/1     Running            0               5m30s   192.168.164.135   k8swrk   <none>           <none>
    etcd-k8scp                                 1/1     Running            0               7m6s    192.168.122.2     k8scp    <none>           <none>
    kube-apiserver-k8scp                       1/1     Running            0               6m22s   192.168.122.2     k8scp    <none>           <none>
    kube-controller-manager-k8scp              1/1     Running            0               5m59s   192.168.122.2     k8scp    <none>           <none>
    kube-proxy-bxbqz                           1/1     Running            0               5m24s   192.168.122.3     k8swrk   <none>           <none>
    kube-proxy-jmt4k                           1/1     Running            0               4m57s   192.168.122.2     k8scp    <none>           <none>
    kube-scheduler-k8scp                       1/1     Running            0               5m45s   192.168.122.2     k8scp    <none>           <none>
    [email protected]:~$ kubectl uncordon k8scp
    node/k8scp uncordoned
    [email protected]:~$ kubectl get node
    NAME     STATUS   ROLES                  AGE   VERSION
    k8scp    Ready    control-plane,master   46h   v1.22.1
    k8swrk   Ready    <none>                 46h   v1.21.1
    [email protected]:~$ kubectl -n kube-system get pods -o wide
    NAME                                       READY   STATUS    RESTARTS        AGE     IP                NODE     NOMINATED NODE   READINESS GATES
    calico-kube-controllers-6b9fbfff44-cwk6z   0/1     Running   7 (5m14s ago)   12m     192.168.164.131   k8swrk   <none>           <none>
    calico-node-j8ntn                          1/1     Running   2 (24h ago)     46h     192.168.122.3     k8swrk   <none>           <none>
    calico-node-tnffg                          1/1     Running   2 (24h ago)     46h     192.168.122.2     k8scp    <none>           <none>
    coredns-558bd4d5db-d22ht                   0/1     Running   0               12m     192.168.164.130   k8swrk   <none>           <none>
    coredns-78fcd69978-87kzd                   0/1     Running   0               7m51s   192.168.164.134   k8swrk   <none>           <none>
    coredns-78fcd69978-sbzck                   0/1     Running   0               7m51s   192.168.164.135   k8swrk   <none>           <none>
    etcd-k8scp                                 1/1     Running   0               9m27s   192.168.122.2     k8scp    <none>           <none>
    kube-apiserver-k8scp                       1/1     Running   0               8m43s   192.168.122.2     k8scp    <none>           <none>
    kube-controller-manager-k8scp              1/1     Running   0               8m20s   192.168.122.2     k8scp    <none>           <none>
    kube-proxy-bxbqz                           1/1     Running   0               7m45s   192.168.122.3     k8swrk   <none>           <none>
    kube-proxy-jmt4k                           1/1     Running   0               7m18s   192.168.122.2     k8scp    <none>           <none>
    kube-scheduler-k8scp                       1/1     Running   0               8m6s    192.168.122.2     k8scp    <none>           <none>
    [email protected]:~$ kubectl -n kube-system get pods -o wide
    NAME                                       READY   STATUS             RESTARTS      AGE     IP                NODE     NOMINATED NODE   READINESS GATES
    calico-kube-controllers-6b9fbfff44-cwk6z   0/1     CrashLoopBackOff   7 (8s ago)    12m     192.168.164.131   k8swrk   <none>           <none>
    calico-node-j8ntn                          1/1     Running            2 (24h ago)   46h     192.168.122.3     k8swrk   <none>           <none>
    calico-node-tnffg                          1/1     Running            2 (24h ago)   46h     192.168.122.2     k8scp    <none>           <none>
    coredns-558bd4d5db-d22ht                   0/1     Running            0             12m     192.168.164.130   k8swrk   <none>           <none>
    coredns-78fcd69978-87kzd                   0/1     Running            0             8m3s    192.168.164.134   k8swrk   <none>           <none>
    coredns-78fcd69978-sbzck                   0/1     Running            0             8m3s    192.168.164.135   k8swrk   <none>           <none>
    etcd-k8scp                                 1/1     Running            0             9m39s   192.168.122.2     k8scp    <none>           <none>
    kube-apiserver-k8scp                       1/1     Running            0             8m55s   192.168.122.2     k8scp    <none>           <none>
    kube-controller-manager-k8scp              1/1     Running            0             8m32s   192.168.122.2     k8scp    <none>           <none>
    kube-proxy-bxbqz                           1/1     Running            0             7m57s   192.168.122.3     k8swrk   <none>           <none>
    kube-proxy-jmt4k                           1/1     Running            0             7m30s   192.168.122.2     k8scp    <none>           <none>
    kube-scheduler-k8scp                       1/1     Running            0             8m18s   192.168.122.2     k8scp    <none>           <none>
    

    Between the last command and the previous only a few seconds passed. The output after kubectl uncordon k8scp of the first kubectl -n kube-system get pods -o wide shows the calico-kube-controller status as "Running" but after a few seconds it showss back again "CrashLoopBackOff"

    Could it be necessary to upgrade Calico too as step 9 seems to suggest? If that is the case, I don't know how to do it and what version would run ok with upgraded versions.

  • jmarinho
    jmarinho Posts: 19
    edited December 2021

    Well, I followed the instructions described here, that redirects me here for upgrading calico installations through calico.yaml manifest, and now I think that the problem is gone. But now the calico objects belongs to a new namespace (calico-system) instead to the original (kube-system):

    [email protected]:~$ kubectl -n calico-system get pods -o wide
    NAME                                       READY   STATUS    RESTARTS   AGE    IP               NODE     NOMINATED NODE   READINESS GATES
    calico-kube-controllers-58494599f9-pr7kn   1/1     Running   0          106s   192.168.74.138   k8scp    <none>           <none>
    calico-node-8hfkw                          1/1     Running   0          47s    192.168.122.2    k8scp    <none>           <none>
    calico-node-drjf6                          1/1     Running   0          35s    192.168.122.3    k8swrk   <none>           <none>
    calico-typha-66698b6b8b-whnbt              1/1     Running   0          49s    192.168.122.3    k8swrk   <none>           <none>
    

    I continued with the worker node upgrade and everything was ok and the previous errors was gone:

    [email protected]:~$ kubectl drain k8swrk --ignore-daemonsets
    node/k8swrk cordoned
    WARNING: ignoring DaemonSet-managed Pods: calico-system/calico-node-drjf6, kube-system/kube-proxy-bxbqz
    evicting pod kube-system/coredns-78fcd69978-sbzck
    evicting pod kube-system/coredns-78fcd69978-87kzd
    evicting pod calico-system/calico-typha-66698b6b8b-whnbt
    evicting pod kube-system/coredns-558bd4d5db-d22ht
    pod/calico-typha-66698b6b8b-whnbt evicted
    pod/coredns-78fcd69978-87kzd evicted
    pod/coredns-78fcd69978-sbzck evicted
    pod/coredns-558bd4d5db-d22ht evicted
    node/k8swrk evicted
    

    If this new configuration can cause me problems in the future for incompatibilities with next labs I will thank that somebody warns me.
    Otherwise, I close that thread

  • Hi @jmarinho,

    Your issues are caused by overlapping IP addresses between the node/VM IPs managed by the hypervisor and the pod IPs managed by Calico. As long as there is such overlap, your cluster will not operate successfully.

    I would recommend rebuilding your cluster and ensuring that the VM IP addresses managed by the hypervisor do not overlap the default 192.168.0.0/16 pod network managed by Calico. You could try assigning your VMs IP addresses from the 10.200.0.0/16 network to prevent any such IP address overlap.

    Regards,
    -Chris

  • jmarinho
    jmarinho Posts: 19

    Hi @chrispokorni,

    Sorry for not answering before, but I did not see the message until today.
    Thank for your advice. You're right and that was the problem. I did not pay attention at the subnet mask, a very silly mistake.
    I thought that upgrading Calico as I mentioned solved the issue and for some reason that I don't know, it seemed that. I did not have any problems after that but, before I noticed your answer, I was having problems installing linkerd on lab 11.1 and probably was related with that.
    As I finally had to rebuild my cluster, I'm redoing the labs and, when I get lab 11 I will see if that was the problem.

    Regards
    Jose

Categories

Upcoming Training