Problem with calico-kube-controller (Lab 4.1)

jmarinho · December 2021

Hi.

After upgrading the cp node successfully, I proceed to upgrade the worker node and I get that error:

jose@k8scp:~$ kubectl drain k8swrk --ignore-daemonsets
node/k8swrk already cordoned
WARNING: ignoring DaemonSet-managed Pods: kube-system/calico-node-j8ntn, kube-system/kube-proxy-2gl5s
evicting pod kube-system/calico-kube-controllers-6b9fbfff44-4lmlh
error when evicting pods/"calico-kube-controllers-6b9fbfff44-4lmlh" -n "kube-system" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
evicting pod kube-system/calico-kube-controllers-6b9fbfff44-4lmlh
error when evicting pods/"calico-kube-controllers-6b9fbfff44-4lmlh" -n "kube-system" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
evicting pod kube-system/calico-kube-controllers-6b9fbfff44-4lmlh
error when evicting pods/"calico-kube-controllers-6b9fbfff44-4lmlh" -n "kube-system" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
^C
.........

Until now the Lab was going well and the upgrade process on the cp node went fine:

jose@k8scp:~$ kubectl get node
NAME     STATUS                     ROLES                  AGE   VERSION
k8scp    Ready                      control-plane,master   39h   v1.22.1
k8swrk   Ready,SchedulingDisabled   <none>                 39h   v1.21.1

Then I uncordon k8swrk and:

jose@k8scp:~$ kubectl get nodes
NAME     STATUS   ROLES                  AGE   VERSION
k8scp    Ready    control-plane,master   42h   v1.22.1
k8swrk   Ready    <none>                 41h   v1.22.1

I ignored the issue because it seemed to go fine and I din't notice any problem on my installation. But after continuing with lab 4.2 I saw the output of some commands that worried me, for example:

 jose@k8scp:~$ kubectl -n kube-system get pods -o wide
NAME                                       READY   STATUS             RESTARTS         AGE     IP                NODE     NOMINATED NODE   READINESS GATES
calico-kube-controllers-6b9fbfff44-4lmlh   0/1     CrashLoopBackOff   58 (2m52s ago)   3h13m   192.168.164.138   k8swrk   <none>           <none>
calico-node-j8ntn                          1/1     Running            5 (82m ago)      41h     192.168.122.3     
k8swrk   <none>           <none>
calico-node-tnffg                          1/1     Running            5                41h     192.168.122.2     k8scp    <none>           <none>
coredns-78fcd69978-hz2kl                   1/1     Running            2 (82m ago)      173m    192.168.74.146    k8scp    <none>           <none>
coredns-78fcd69978-mczhs                   1/1     Running            2 (82m ago)      173m    192.168.74.147    k8scp    <none>           <none>

<ommited>

And, as it can be observed, the calico-kube-controllers are with a CrashLoopBackOff status and I suspect that it is not a good signal.

What is wrong with that?

I tried kubectl drain again but with the same results.

jmarinho · December 2021

I'm doing again Lab 4.1 from a snapshot of my VM's and, after upgrading the cp node, on step 15 I noticed that the problem is there again. I issued the command kubectl uncordon k8scp but that didn't help. I append the output of all this for better debugging:

jose@k8scp:~$ kubectl get node
NAME     STATUS                     ROLES                  AGE   VERSION
k8scp    Ready,SchedulingDisabled   control-plane,master   46h   v1.22.1
k8swrk   Ready                      <none>                 46h   v1.21.1
jose@k8scp:~$ kubectl -n kube-system get pods -o wide
NAME                                       READY   STATUS             RESTARTS        AGE     IP                NODE     NOMINATED NODE   READINESS GATES
calico-kube-controllers-6b9fbfff44-cwk6z   0/1     CrashLoopBackOff   6 (2m53s ago)   10m     192.168.164.131   k8swrk   <none>           <none>
calico-node-j8ntn                          1/1     Running            2 (23h ago)     46h     192.168.122.3     k8swrk   <none>           <none>
calico-node-tnffg                          1/1     Running            2 (23h ago)     46h     192.168.122.2     k8scp    <none>           <none>
coredns-558bd4d5db-d22ht                   0/1     Running            0               10m     192.168.164.130   k8swrk   <none>           <none>
coredns-78fcd69978-87kzd                   0/1     Running            0               5m30s   192.168.164.134   k8swrk   <none>           <none>
coredns-78fcd69978-sbzck                   0/1     Running            0               5m30s   192.168.164.135   k8swrk   <none>           <none>
etcd-k8scp                                 1/1     Running            0               7m6s    192.168.122.2     k8scp    <none>           <none>
kube-apiserver-k8scp                       1/1     Running            0               6m22s   192.168.122.2     k8scp    <none>           <none>
kube-controller-manager-k8scp              1/1     Running            0               5m59s   192.168.122.2     k8scp    <none>           <none>
kube-proxy-bxbqz                           1/1     Running            0               5m24s   192.168.122.3     k8swrk   <none>           <none>
kube-proxy-jmt4k                           1/1     Running            0               4m57s   192.168.122.2     k8scp    <none>           <none>
kube-scheduler-k8scp                       1/1     Running            0               5m45s   192.168.122.2     k8scp    <none>           <none>
jose@k8scp:~$ kubectl uncordon k8scp
node/k8scp uncordoned
jose@k8scp:~$ kubectl get node
NAME     STATUS   ROLES                  AGE   VERSION
k8scp    Ready    control-plane,master   46h   v1.22.1
k8swrk   Ready    <none>                 46h   v1.21.1
jose@k8scp:~$ kubectl -n kube-system get pods -o wide
NAME                                       READY   STATUS    RESTARTS        AGE     IP                NODE     NOMINATED NODE   READINESS GATES
calico-kube-controllers-6b9fbfff44-cwk6z   0/1     Running   7 (5m14s ago)   12m     192.168.164.131   k8swrk   <none>           <none>
calico-node-j8ntn                          1/1     Running   2 (24h ago)     46h     192.168.122.3     k8swrk   <none>           <none>
calico-node-tnffg                          1/1     Running   2 (24h ago)     46h     192.168.122.2     k8scp    <none>           <none>
coredns-558bd4d5db-d22ht                   0/1     Running   0               12m     192.168.164.130   k8swrk   <none>           <none>
coredns-78fcd69978-87kzd                   0/1     Running   0               7m51s   192.168.164.134   k8swrk   <none>           <none>
coredns-78fcd69978-sbzck                   0/1     Running   0               7m51s   192.168.164.135   k8swrk   <none>           <none>
etcd-k8scp                                 1/1     Running   0               9m27s   192.168.122.2     k8scp    <none>           <none>
kube-apiserver-k8scp                       1/1     Running   0               8m43s   192.168.122.2     k8scp    <none>           <none>
kube-controller-manager-k8scp              1/1     Running   0               8m20s   192.168.122.2     k8scp    <none>           <none>
kube-proxy-bxbqz                           1/1     Running   0               7m45s   192.168.122.3     k8swrk   <none>           <none>
kube-proxy-jmt4k                           1/1     Running   0               7m18s   192.168.122.2     k8scp    <none>           <none>
kube-scheduler-k8scp                       1/1     Running   0               8m6s    192.168.122.2     k8scp    <none>           <none>
jose@k8scp:~$ kubectl -n kube-system get pods -o wide
NAME                                       READY   STATUS             RESTARTS      AGE     IP                NODE     NOMINATED NODE   READINESS GATES
calico-kube-controllers-6b9fbfff44-cwk6z   0/1     CrashLoopBackOff   7 (8s ago)    12m     192.168.164.131   k8swrk   <none>           <none>
calico-node-j8ntn                          1/1     Running            2 (24h ago)   46h     192.168.122.3     k8swrk   <none>           <none>
calico-node-tnffg                          1/1     Running            2 (24h ago)   46h     192.168.122.2     k8scp    <none>           <none>
coredns-558bd4d5db-d22ht                   0/1     Running            0             12m     192.168.164.130   k8swrk   <none>           <none>
coredns-78fcd69978-87kzd                   0/1     Running            0             8m3s    192.168.164.134   k8swrk   <none>           <none>
coredns-78fcd69978-sbzck                   0/1     Running            0             8m3s    192.168.164.135   k8swrk   <none>           <none>
etcd-k8scp                                 1/1     Running            0             9m39s   192.168.122.2     k8scp    <none>           <none>
kube-apiserver-k8scp                       1/1     Running            0             8m55s   192.168.122.2     k8scp    <none>           <none>
kube-controller-manager-k8scp              1/1     Running            0             8m32s   192.168.122.2     k8scp    <none>           <none>
kube-proxy-bxbqz                           1/1     Running            0             7m57s   192.168.122.3     k8swrk   <none>           <none>
kube-proxy-jmt4k                           1/1     Running            0             7m30s   192.168.122.2     k8scp    <none>           <none>
kube-scheduler-k8scp                       1/1     Running            0             8m18s   192.168.122.2     k8scp    <none>           <none>

Between the last command and the previous only a few seconds passed. The output after kubectl uncordon k8scp of the first kubectl -n kube-system get pods -o wide shows the calico-kube-controller status as "Running" but after a few seconds it showss back again "CrashLoopBackOff"

Could it be necessary to upgrade Calico too as step 9 seems to suggest? If that is the case, I don't know how to do it and what version would run ok with upgraded versions.

jmarinho · December 2021

Well, I followed the instructions described here, that redirects me here for upgrading calico installations through calico.yaml manifest, and now I think that the problem is gone. But now the calico objects belongs to a new namespace (calico-system) instead to the original (kube-system):

jose@k8scp:~$ kubectl -n calico-system get pods -o wide
NAME                                       READY   STATUS    RESTARTS   AGE    IP               NODE     NOMINATED NODE   READINESS GATES
calico-kube-controllers-58494599f9-pr7kn   1/1     Running   0          106s   192.168.74.138   k8scp    <none>           <none>
calico-node-8hfkw                          1/1     Running   0          47s    192.168.122.2    k8scp    <none>           <none>
calico-node-drjf6                          1/1     Running   0          35s    192.168.122.3    k8swrk   <none>           <none>
calico-typha-66698b6b8b-whnbt              1/1     Running   0          49s    192.168.122.3    k8swrk   <none>           <none>

I continued with the worker node upgrade and everything was ok and the previous errors was gone:

jose@k8scp:~$ kubectl drain k8swrk --ignore-daemonsets
node/k8swrk cordoned
WARNING: ignoring DaemonSet-managed Pods: calico-system/calico-node-drjf6, kube-system/kube-proxy-bxbqz
evicting pod kube-system/coredns-78fcd69978-sbzck
evicting pod kube-system/coredns-78fcd69978-87kzd
evicting pod calico-system/calico-typha-66698b6b8b-whnbt
evicting pod kube-system/coredns-558bd4d5db-d22ht
pod/calico-typha-66698b6b8b-whnbt evicted
pod/coredns-78fcd69978-87kzd evicted
pod/coredns-78fcd69978-sbzck evicted
pod/coredns-558bd4d5db-d22ht evicted
node/k8swrk evicted

If this new configuration can cause me problems in the future for incompatibilities with next labs I will thank that somebody warns me.
Otherwise, I close that thread

chrispokorni · December 2021

Hi @jmarinho,

Your issues are caused by overlapping IP addresses between the node/VM IPs managed by the hypervisor and the pod IPs managed by Calico. As long as there is such overlap, your cluster will not operate successfully.

I would recommend rebuilding your cluster and ensuring that the VM IP addresses managed by the hypervisor do not overlap the default 192.168.0.0/16 pod network managed by Calico. You could try assigning your VMs IP addresses from the 10.200.0.0/16 network to prevent any such IP address overlap.

Regards,
-Chris

jmarinho · January 2022

Hi @chrispokorni,

Sorry for not answering before, but I did not see the message until today.
Thank for your advice. You're right and that was the problem. I did not pay attention at the subnet mask, a very silly mistake.
I thought that upgrading Calico as I mentioned solved the issue and for some reason that I don't know, it seemed that. I did not have any problems after that but, before I noticed your answer, I was having problems installing linkerd on lab 11.1 and probably was related with that.
As I finally had to rebuild my cluster, I'm redoing the labs and, when I get lab 11 I will see if that was the problem.

Regards
Jose

rsheikh · July 2022

I am glad to find this note. I was precisely at the stage of upgrading worker node. I have spent about 6 days and while troubleshooting learned a lot but not what I needed.

I hope that @Chris could comment.

AWS master-worker node

ubuntu@ip-172-31-25-66:~$ ku -n kube-system get po -o wide

NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
calico-kube-controllers-685b65ddf9-rlcrv 0/1 CrashLoopBackOff 23 (84s ago) 98m 172.31.104.240 ip-172-31-26-1
calico-node-gm8sz 1/1 Running 0 3h43m 172.31.26.1 ip-172-31-26-1
calico-node-xnqnx 1/1 Running 1 (3h17m ago) 3h43m 172.31.25.66 ip-172-31-25-66
coredns-64897985d-fz4nj 1/1 Running 6 (3h17m ago) 7d23h 172.31.52.227 ip-172-31-25-66
coredns-64897985d-shblb 1/1 Running 6 (3h17m ago) 7d23h 172.31.52.228 ip-172-31-25-66
etcd-ip-172-31-25-66 1/1 Running 7 (3h17m ago) 8d 172.31.25.66 ip-172-31-25-66
kube-apiserver-ip-172-31-25-66 1/1 Running 10 (3h17m ago) 8d 172.31.25.66 ip-172-31-25-66
kube-controller-manager-ip-172-31-25-66 1/1 Running 6 (3h17m ago) 8d 172.31.25.66 ip-172-31-25-66
kube-proxy-x9xrc 1/1 Running 6 (3h17m ago) 8d 172.31.25.66 ip-172-31-25-66
kube-proxy-zcq2t 1/1 Running 4 (4h41m ago) 8d 172.31.26.1 ip-172-31-26-1
kube-scheduler-ip-172-31-25-66 1/1 Running 6 (3h17m ago) 8d
172.31.25.66 ip-172-31-25-66

I am getting the following errors

DESCRIBE

ku -n kube-system describe po calico-kube-controllers-685b65ddf9-rlcrv
Warning Unhealthy 29m (x10 over 30m) kubelet Readiness probe failed: Failed to read status file /status/status.json: unexpected end of JSON input
Normal Pulled 29m (x4 over 30m) kubelet Container image "docker.io/calico/kube-controllers:v3.23.1" already present on machine
Warning BackOff 40s (x136 over 30m) kubelet Back-off restarting failed container

LOG KUBE-PROXY
ku -n kube-system logs kube-proxy-x9xrc
E0703 17:58:39.111615 1 proxier.go:1600] "can't open port, skipping it" err="listen tcp4 :31107: bind: address already in use" port={Description:nodePort for default/nginx IP: IPFamily:4 Port:31107 Protocol:TCP}

DESCRIBE
ku -n kube-system describe po kube-proxy-x9xrc
kube-proxy:
Container ID: docker://fb063dc9345ee6e122f00c00265e7c41e5f330a240db855a0c580b71823207e7
Image: k8s.gcr.io/kube-proxy:v1.23.1
Image ID: docker-pullable://k8s.gcr.io/kube-proxy@sha256:e40f3a28721588affcf187f3f246d1e078157dabe274003eaa2957a83f7170c8
Port:
Host Port:
ku -n kube-system describe po kube-proxy-zcq2t
Containers:
kube-proxy:
Container ID: docker://a4163a3b6548904078d592ade2f948d0a96bb566863cbddbd153a5fa18fd0300
Image: k8s.gcr.io/kube-proxy:v1.23.1
Image ID: docker-pullable://k8s.gcr.io/kube-proxy@sha256:e40f3a28721588affcf187f3f246d1e078157dabe274003eaa2957a83f7170c8
Port:
Host Port:

ku -n kube-system logs kube-proxy-zcq2t
E0703 16:35:32.334879 1 proxier.go:1600] "can't open port, skipping it" err="listen tcp4 :31107: bind: address already in use" port={Description:nodePort for default/nginx IP: IPFamily:4 Port:31107 Protocol:TCP}

ku -n kube-system logs calico-node-gm8sz
2022-07-03 20:22:18.097 [INFO][71] monitor-addresses/autodetection_methods.go 103: Using autodetected IPv4 address on interface eth0: 172.31.26.1/20
2022-07-03 20:22:22.413 [INFO][68] felix/summary.go 100: Summarising 12 dataplane reconciliation loops over 1m2.5s: avg=4ms longest=8ms (resync-nat-v4)

ku -n kube-system logs calico-node-xnqnx
2022-07-03 20:24:44.924 [INFO][71] monitor-addresses/autodetection_methods.go 103: Using autodetected IPv4 address on interface eth0: 172.31.25.66/20
2022-07-03 20:24:47.346 [INFO][66] felix/summary.go 100: Summarising 11 dataplane reconciliation loops over 1m3.4s: avg=4ms longest=12ms ()

error: unable to upgrade connection: container not found ("calico-kube-controllers")

I could use some insight!

rsheikh · July 2022

@Chris, please ignore. I reviewed my steps Vs. your instructions and the error was mine, indeed the IP overlap did occur. I am re-building the cluster. I am glad i was stuck for 6 days since I learned much. Thx.

Problem with calico-kube-controller (Lab 4.1)

Comments

DESCRIBE

Categories

Upcoming Training

Kubernetes Administration (LFS458)

Linux System Administration (LFS301)

Open Source Virtualization (LFS462)

Linux Kernel Debugging and Security (LFD440)