Welcome to the Linux Foundation Forum!

Problem with calico-kube-controller (Lab 4.1)

Hi.

After upgrading the cp node successfully, I proceed to upgrade the worker node and I get that error:

jose@k8scp:~$ kubectl drain k8swrk --ignore-daemonsets
node/k8swrk already cordoned
WARNING: ignoring DaemonSet-managed Pods: kube-system/calico-node-j8ntn, kube-system/kube-proxy-2gl5s
evicting pod kube-system/calico-kube-controllers-6b9fbfff44-4lmlh
error when evicting pods/"calico-kube-controllers-6b9fbfff44-4lmlh" -n "kube-system" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
evicting pod kube-system/calico-kube-controllers-6b9fbfff44-4lmlh
error when evicting pods/"calico-kube-controllers-6b9fbfff44-4lmlh" -n "kube-system" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
evicting pod kube-system/calico-kube-controllers-6b9fbfff44-4lmlh
error when evicting pods/"calico-kube-controllers-6b9fbfff44-4lmlh" -n "kube-system" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
^C
.........

Until now the Lab was going well and the upgrade process on the cp node went fine:

jose@k8scp:~$ kubectl get node
NAME     STATUS                     ROLES                  AGE   VERSION
k8scp    Ready                      control-plane,master   39h   v1.22.1
k8swrk   Ready,SchedulingDisabled   <none>                 39h   v1.21.1

Then I uncordon k8swrk and:

jose@k8scp:~$ kubectl get nodes
NAME     STATUS   ROLES                  AGE   VERSION
k8scp    Ready    control-plane,master   42h   v1.22.1
k8swrk   Ready    <none>                 41h   v1.22.1

I ignored the issue because it seemed to go fine and I din't notice any problem on my installation. But after continuing with lab 4.2 I saw the output of some commands that worried me, for example:

 jose@k8scp:~$ kubectl -n kube-system get pods -o wide
NAME                                       READY   STATUS             RESTARTS         AGE     IP                NODE     NOMINATED NODE   READINESS GATES
calico-kube-controllers-6b9fbfff44-4lmlh   0/1     CrashLoopBackOff   58 (2m52s ago)   3h13m   192.168.164.138   k8swrk   <none>           <none>
calico-node-j8ntn                          1/1     Running            5 (82m ago)      41h     192.168.122.3     
k8swrk   <none>           <none>
calico-node-tnffg                          1/1     Running            5                41h     192.168.122.2     k8scp    <none>           <none>
coredns-78fcd69978-hz2kl                   1/1     Running            2 (82m ago)      173m    192.168.74.146    k8scp    <none>           <none>
coredns-78fcd69978-mczhs                   1/1     Running            2 (82m ago)      173m    192.168.74.147    k8scp    <none>           <none>

<ommited>

And, as it can be observed, the calico-kube-controllers are with a CrashLoopBackOff status and I suspect that it is not a good signal.

What is wrong with that?

I tried kubectl drain again but with the same results.

Comments

  • jmarinho
    jmarinho Posts: 19
    edited December 2021

    I'm doing again Lab 4.1 from a snapshot of my VM's and, after upgrading the cp node, on step 15 I noticed that the problem is there again. I issued the command kubectl uncordon k8scp but that didn't help. I append the output of all this for better debugging:

    jose@k8scp:~$ kubectl get node
    NAME     STATUS                     ROLES                  AGE   VERSION
    k8scp    Ready,SchedulingDisabled   control-plane,master   46h   v1.22.1
    k8swrk   Ready                      <none>                 46h   v1.21.1
    jose@k8scp:~$ kubectl -n kube-system get pods -o wide
    NAME                                       READY   STATUS             RESTARTS        AGE     IP                NODE     NOMINATED NODE   READINESS GATES
    calico-kube-controllers-6b9fbfff44-cwk6z   0/1     CrashLoopBackOff   6 (2m53s ago)   10m     192.168.164.131   k8swrk   <none>           <none>
    calico-node-j8ntn                          1/1     Running            2 (23h ago)     46h     192.168.122.3     k8swrk   <none>           <none>
    calico-node-tnffg                          1/1     Running            2 (23h ago)     46h     192.168.122.2     k8scp    <none>           <none>
    coredns-558bd4d5db-d22ht                   0/1     Running            0               10m     192.168.164.130   k8swrk   <none>           <none>
    coredns-78fcd69978-87kzd                   0/1     Running            0               5m30s   192.168.164.134   k8swrk   <none>           <none>
    coredns-78fcd69978-sbzck                   0/1     Running            0               5m30s   192.168.164.135   k8swrk   <none>           <none>
    etcd-k8scp                                 1/1     Running            0               7m6s    192.168.122.2     k8scp    <none>           <none>
    kube-apiserver-k8scp                       1/1     Running            0               6m22s   192.168.122.2     k8scp    <none>           <none>
    kube-controller-manager-k8scp              1/1     Running            0               5m59s   192.168.122.2     k8scp    <none>           <none>
    kube-proxy-bxbqz                           1/1     Running            0               5m24s   192.168.122.3     k8swrk   <none>           <none>
    kube-proxy-jmt4k                           1/1     Running            0               4m57s   192.168.122.2     k8scp    <none>           <none>
    kube-scheduler-k8scp                       1/1     Running            0               5m45s   192.168.122.2     k8scp    <none>           <none>
    jose@k8scp:~$ kubectl uncordon k8scp
    node/k8scp uncordoned
    jose@k8scp:~$ kubectl get node
    NAME     STATUS   ROLES                  AGE   VERSION
    k8scp    Ready    control-plane,master   46h   v1.22.1
    k8swrk   Ready    <none>                 46h   v1.21.1
    jose@k8scp:~$ kubectl -n kube-system get pods -o wide
    NAME                                       READY   STATUS    RESTARTS        AGE     IP                NODE     NOMINATED NODE   READINESS GATES
    calico-kube-controllers-6b9fbfff44-cwk6z   0/1     Running   7 (5m14s ago)   12m     192.168.164.131   k8swrk   <none>           <none>
    calico-node-j8ntn                          1/1     Running   2 (24h ago)     46h     192.168.122.3     k8swrk   <none>           <none>
    calico-node-tnffg                          1/1     Running   2 (24h ago)     46h     192.168.122.2     k8scp    <none>           <none>
    coredns-558bd4d5db-d22ht                   0/1     Running   0               12m     192.168.164.130   k8swrk   <none>           <none>
    coredns-78fcd69978-87kzd                   0/1     Running   0               7m51s   192.168.164.134   k8swrk   <none>           <none>
    coredns-78fcd69978-sbzck                   0/1     Running   0               7m51s   192.168.164.135   k8swrk   <none>           <none>
    etcd-k8scp                                 1/1     Running   0               9m27s   192.168.122.2     k8scp    <none>           <none>
    kube-apiserver-k8scp                       1/1     Running   0               8m43s   192.168.122.2     k8scp    <none>           <none>
    kube-controller-manager-k8scp              1/1     Running   0               8m20s   192.168.122.2     k8scp    <none>           <none>
    kube-proxy-bxbqz                           1/1     Running   0               7m45s   192.168.122.3     k8swrk   <none>           <none>
    kube-proxy-jmt4k                           1/1     Running   0               7m18s   192.168.122.2     k8scp    <none>           <none>
    kube-scheduler-k8scp                       1/1     Running   0               8m6s    192.168.122.2     k8scp    <none>           <none>
    jose@k8scp:~$ kubectl -n kube-system get pods -o wide
    NAME                                       READY   STATUS             RESTARTS      AGE     IP                NODE     NOMINATED NODE   READINESS GATES
    calico-kube-controllers-6b9fbfff44-cwk6z   0/1     CrashLoopBackOff   7 (8s ago)    12m     192.168.164.131   k8swrk   <none>           <none>
    calico-node-j8ntn                          1/1     Running            2 (24h ago)   46h     192.168.122.3     k8swrk   <none>           <none>
    calico-node-tnffg                          1/1     Running            2 (24h ago)   46h     192.168.122.2     k8scp    <none>           <none>
    coredns-558bd4d5db-d22ht                   0/1     Running            0             12m     192.168.164.130   k8swrk   <none>           <none>
    coredns-78fcd69978-87kzd                   0/1     Running            0             8m3s    192.168.164.134   k8swrk   <none>           <none>
    coredns-78fcd69978-sbzck                   0/1     Running            0             8m3s    192.168.164.135   k8swrk   <none>           <none>
    etcd-k8scp                                 1/1     Running            0             9m39s   192.168.122.2     k8scp    <none>           <none>
    kube-apiserver-k8scp                       1/1     Running            0             8m55s   192.168.122.2     k8scp    <none>           <none>
    kube-controller-manager-k8scp              1/1     Running            0             8m32s   192.168.122.2     k8scp    <none>           <none>
    kube-proxy-bxbqz                           1/1     Running            0             7m57s   192.168.122.3     k8swrk   <none>           <none>
    kube-proxy-jmt4k                           1/1     Running            0             7m30s   192.168.122.2     k8scp    <none>           <none>
    kube-scheduler-k8scp                       1/1     Running            0             8m18s   192.168.122.2     k8scp    <none>           <none>
    

    Between the last command and the previous only a few seconds passed. The output after kubectl uncordon k8scp of the first kubectl -n kube-system get pods -o wide shows the calico-kube-controller status as "Running" but after a few seconds it showss back again "CrashLoopBackOff"

    Could it be necessary to upgrade Calico too as step 9 seems to suggest? If that is the case, I don't know how to do it and what version would run ok with upgraded versions.

  • jmarinho
    jmarinho Posts: 19
    edited December 2021

    Well, I followed the instructions described here, that redirects me here for upgrading calico installations through calico.yaml manifest, and now I think that the problem is gone. But now the calico objects belongs to a new namespace (calico-system) instead to the original (kube-system):

    jose@k8scp:~$ kubectl -n calico-system get pods -o wide
    NAME                                       READY   STATUS    RESTARTS   AGE    IP               NODE     NOMINATED NODE   READINESS GATES
    calico-kube-controllers-58494599f9-pr7kn   1/1     Running   0          106s   192.168.74.138   k8scp    <none>           <none>
    calico-node-8hfkw                          1/1     Running   0          47s    192.168.122.2    k8scp    <none>           <none>
    calico-node-drjf6                          1/1     Running   0          35s    192.168.122.3    k8swrk   <none>           <none>
    calico-typha-66698b6b8b-whnbt              1/1     Running   0          49s    192.168.122.3    k8swrk   <none>           <none>
    

    I continued with the worker node upgrade and everything was ok and the previous errors was gone:

    jose@k8scp:~$ kubectl drain k8swrk --ignore-daemonsets
    node/k8swrk cordoned
    WARNING: ignoring DaemonSet-managed Pods: calico-system/calico-node-drjf6, kube-system/kube-proxy-bxbqz
    evicting pod kube-system/coredns-78fcd69978-sbzck
    evicting pod kube-system/coredns-78fcd69978-87kzd
    evicting pod calico-system/calico-typha-66698b6b8b-whnbt
    evicting pod kube-system/coredns-558bd4d5db-d22ht
    pod/calico-typha-66698b6b8b-whnbt evicted
    pod/coredns-78fcd69978-87kzd evicted
    pod/coredns-78fcd69978-sbzck evicted
    pod/coredns-558bd4d5db-d22ht evicted
    node/k8swrk evicted
    

    If this new configuration can cause me problems in the future for incompatibilities with next labs I will thank that somebody warns me.
    Otherwise, I close that thread

  • Hi @jmarinho,

    Your issues are caused by overlapping IP addresses between the node/VM IPs managed by the hypervisor and the pod IPs managed by Calico. As long as there is such overlap, your cluster will not operate successfully.

    I would recommend rebuilding your cluster and ensuring that the VM IP addresses managed by the hypervisor do not overlap the default 192.168.0.0/16 pod network managed by Calico. You could try assigning your VMs IP addresses from the 10.200.0.0/16 network to prevent any such IP address overlap.

    Regards,
    -Chris

  • Hi @chrispokorni,

    Sorry for not answering before, but I did not see the message until today.
    Thank for your advice. You're right and that was the problem. I did not pay attention at the subnet mask, a very silly mistake.
    I thought that upgrading Calico as I mentioned solved the issue and for some reason that I don't know, it seemed that. I did not have any problems after that but, before I noticed your answer, I was having problems installing linkerd on lab 11.1 and probably was related with that.
    As I finally had to rebuild my cluster, I'm redoing the labs and, when I get lab 11 I will see if that was the problem.

    Regards
    Jose

  • rsheikh
    rsheikh Posts: 2

    I am glad to find this note. I was precisely at the stage of upgrading worker node. I have spent about 6 days and while troubleshooting learned a lot but not what I needed.

    I hope that @Chris could comment.

    AWS master-worker node

    ubuntu@ip-172-31-25-66:~$ ku -n kube-system get po -o wide

    NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
    calico-kube-controllers-685b65ddf9-rlcrv 0/1 CrashLoopBackOff 23 (84s ago) 98m 172.31.104.240 ip-172-31-26-1
    calico-node-gm8sz 1/1 Running 0 3h43m 172.31.26.1 ip-172-31-26-1
    calico-node-xnqnx 1/1 Running 1 (3h17m ago) 3h43m 172.31.25.66 ip-172-31-25-66
    coredns-64897985d-fz4nj 1/1 Running 6 (3h17m ago) 7d23h 172.31.52.227 ip-172-31-25-66
    coredns-64897985d-shblb 1/1 Running 6 (3h17m ago) 7d23h 172.31.52.228 ip-172-31-25-66
    etcd-ip-172-31-25-66 1/1 Running 7 (3h17m ago) 8d 172.31.25.66 ip-172-31-25-66
    kube-apiserver-ip-172-31-25-66 1/1 Running 10 (3h17m ago) 8d 172.31.25.66 ip-172-31-25-66
    kube-controller-manager-ip-172-31-25-66 1/1 Running 6 (3h17m ago) 8d 172.31.25.66 ip-172-31-25-66
    kube-proxy-x9xrc 1/1 Running 6 (3h17m ago) 8d 172.31.25.66 ip-172-31-25-66
    kube-proxy-zcq2t 1/1 Running 4 (4h41m ago) 8d 172.31.26.1 ip-172-31-26-1
    kube-scheduler-ip-172-31-25-66 1/1 Running 6 (3h17m ago) 8d
    172.31.25.66 ip-172-31-25-66

    I am getting the following errors

    DESCRIBE

    ku -n kube-system describe po calico-kube-controllers-685b65ddf9-rlcrv
    Warning Unhealthy 29m (x10 over 30m) kubelet Readiness probe failed: Failed to read status file /status/status.json: unexpected end of JSON input
    Normal Pulled 29m (x4 over 30m) kubelet Container image "docker.io/calico/kube-controllers:v3.23.1" already present on machine
    Warning BackOff 40s (x136 over 30m) kubelet Back-off restarting failed container

    LOG KUBE-PROXY
    ku -n kube-system logs kube-proxy-x9xrc
    E0703 17:58:39.111615 1 proxier.go:1600] "can't open port, skipping it" err="listen tcp4 :31107: bind: address already in use" port={Description:nodePort for default/nginx IP: IPFamily:4 Port:31107 Protocol:TCP}

    DESCRIBE
    ku -n kube-system describe po kube-proxy-x9xrc
    kube-proxy:
    Container ID: docker://fb063dc9345ee6e122f00c00265e7c41e5f330a240db855a0c580b71823207e7
    Image: k8s.gcr.io/kube-proxy:v1.23.1
    Image ID: docker-pullable://k8s.gcr.io/kube-proxy@sha256:e40f3a28721588affcf187f3f246d1e078157dabe274003eaa2957a83f7170c8
    Port:
    Host Port:
    ku -n kube-system describe po kube-proxy-zcq2t
    Containers:
    kube-proxy:
    Container ID: docker://a4163a3b6548904078d592ade2f948d0a96bb566863cbddbd153a5fa18fd0300
    Image: k8s.gcr.io/kube-proxy:v1.23.1
    Image ID: docker-pullable://k8s.gcr.io/kube-proxy@sha256:e40f3a28721588affcf187f3f246d1e078157dabe274003eaa2957a83f7170c8
    Port:
    Host Port:

    ku -n kube-system logs kube-proxy-zcq2t
    E0703 16:35:32.334879 1 proxier.go:1600] "can't open port, skipping it" err="listen tcp4 :31107: bind: address already in use" port={Description:nodePort for default/nginx IP: IPFamily:4 Port:31107 Protocol:TCP}

    ku -n kube-system logs calico-node-gm8sz
    2022-07-03 20:22:18.097 [INFO][71] monitor-addresses/autodetection_methods.go 103: Using autodetected IPv4 address on interface eth0: 172.31.26.1/20
    2022-07-03 20:22:22.413 [INFO][68] felix/summary.go 100: Summarising 12 dataplane reconciliation loops over 1m2.5s: avg=4ms longest=8ms (resync-nat-v4)

    ku -n kube-system logs calico-node-xnqnx
    2022-07-03 20:24:44.924 [INFO][71] monitor-addresses/autodetection_methods.go 103: Using autodetected IPv4 address on interface eth0: 172.31.25.66/20
    2022-07-03 20:24:47.346 [INFO][66] felix/summary.go 100: Summarising 11 dataplane reconciliation loops over 1m3.4s: avg=4ms longest=12ms ()

    error: unable to upgrade connection: container not found ("calico-kube-controllers")

    I could use some insight!

  • rsheikh
    rsheikh Posts: 2

    @Chris, please ignore. I reviewed my steps Vs. your instructions and the error was mine, indeed the IP overlap did occur. I am re-building the cluster. I am glad i was stuck for 6 days since I learned much. Thx.

Categories

Upcoming Training