Welcome to the Linux Foundation Forum!

Exercise 3.3 - Worker node not ready

I have built the lab in Google Cloud as suggested, and I believe I have followed the instructions to the letter, however after I joined my worker node to my cluster (Exercise 3.2) the worker node is in a "NotReady" state and has been like that for a few hours, I think its the network connectivity between the nodes as when I try to ping one from the other I get "No route to host" yet, if I deploy a 3rd VM on the same network they can both ping the 3rd VM and the 3rd VM can ping the other 2, cheers.

Comments

  • chrispokorni
    chrispokorni Posts: 2,349

    Hi @jhurlstone,

    Are all VMs in the same custom VPC, same subnet, and the VPC firewall rule allows all inbound traffic as per the demo video from the introductory chapter?

    Regards,
    -Chris

  • I believe that I followed the video exactly. I have given both the master & worker VM's a reboot and they can now ping each other, and I have been running "kubectl get nodes" on the master periodically and very occasionally it will report that the worker is ready, then a few seconds later it goes back to being "NotReady".

  • chrispokorni
    chrispokorni Posts: 2,349

    Hi @jhurlstone,

    What are the outputs of

    kubectl get nodes -o wide

    kubectl get pods -A -o wide

    Regards,
    -Chris

  • get nodes -o wide
    NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
    master Ready control-plane 5h37m v1.25.1 10.2.0.6 Ubuntu 20.04.5 LTS 5.15.0-1030-gcp containerd://1.6.18
    worker NotReady 3h44m v1.25.1 10.2.0.7 Ubuntu 20.04.5 LTS 5.15.0-1030-gcp containerd://1.6.18

    kubectl get pods -A -o wide
    NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
    kube-system calico-kube-controllers-74677b4c5f-p9c9c 1/1 Running 1 (128m ago) 5h29m 10.2.219.69 master
    kube-system calico-node-qhv2t 0/1 Running 1 (30m ago) 40m 10.2.0.7 worker
    kube-system calico-node-tv5cx 0/1 Running 1 (128m ago) 5h29m 10.2.0.6 master
    kube-system coredns-565d847f94-885qq 1/1 Running 1 (128m ago) 5h38m 10.2.219.68 master
    kube-system coredns-565d847f94-rc2l2 1/1 Running 1 (128m ago) 5h38m 10.2.219.70 master
    kube-system etcd-master 1/1 Running 1 (128m ago) 5h38m 10.2.0.6 master
    kube-system kube-apiserver-master 1/1 Running 1 (128m ago) 5h38m 10.2.0.6 master
    kube-system kube-controller-manager-master 1/1 Running 1 (128m ago) 5h38m 10.2.0.6 master
    kube-system kube-proxy-2xfv5 1/1 Running 4 (30m ago) 3h44m 10.2.0.7 worker
    kube-system kube-proxy-rlsxq 1/1 Running 1 (128m ago) 5h38m 10.2.0.6 master
    kube-system kube-scheduler-master 1/1 Running 1 (128m ago) 5h38m 10.2.0.6 master
    root@master:~#

  • chrispokorni
    chrispokorni Posts: 2,349

    Hi @jhurlstone,

    What is the machine type of your GCE VMs?

    Are you running kubectl as root? Why?

    Regards,
    -Chris

  • I may have spotted a typo in my installation, I did not replace the "controlPlaneEndpoint: "k8scp:6443" with the actual hostname when I used the "kubeadm-config.yaml" my hostname for the controlplane VM is "master".

  • chrispokorni
    chrispokorni Posts: 2,349

    You don't have to replace it. As long as the alias is in the /etc/hosts file, it should all work. Also, make sure you have the correct control plane node IP in both cp and worker /etc/hosts files

  • The machine types are "e2-standard-2".

  • chrispokorni
    chrispokorni Posts: 2,349

    Hi @jhurlstone,

    Since you modified your installation and configured root with kubectl (not a good practice), what other changes have you made?

    When running the following commands:

    kubectl describe node worker

    kubectl -n kube-system describe pod calico-node-tv5cx

    What are the Events at the very bottom of both outputs?

    Regards,
    -Chris

  • I have double checked the etc/hosts which contain the same entry on both nodes "10.2.0.6 k8scp" and have just run multiple kubectl get nodes -o wide and as you can see the "worker" does occasionally come "Ready" then flicks back to "NotReady"

    student@master:~$ kubectl get nodes -o wide
    NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
    master Ready control-plane 6h1m v1.25.1 10.2.0.6 Ubuntu 20.04.5 LTS 5.15.0-1030-gcp containerd://1.6.18
    worker Ready 4h8m v1.25.1 10.2.0.7 Ubuntu 20.04.5 LTS 5.15.0-1030-gcp containerd://1.6.18
    student@master:~$ kubectl get nodes -o wide
    NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
    master Ready control-plane 6h1m v1.25.1 10.2.0.6 Ubuntu 20.04.5 LTS 5.15.0-1030-gcp containerd://1.6.18
    worker Ready 4h8m v1.25.1 10.2.0.7 Ubuntu 20.04.5 LTS 5.15.0-1030-gcp containerd://1.6.18
    student@master:~$ kubectl get nodes -o wide
    NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
    master Ready control-plane 6h2m v1.25.1 10.2.0.6 Ubuntu 20.04.5 LTS 5.15.0-1030-gcp containerd://1.6.18
    worker Ready 4h8m v1.25.1 10.2.0.7 Ubuntu 20.04.5 LTS 5.15.0-1030-gcp containerd://1.6.18
    student@master:~$ kubectl get nodes -o wide
    NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
    master Ready control-plane 6h2m v1.25.1 10.2.0.6 Ubuntu 20.04.5 LTS 5.15.0-1030-gcp containerd://1.6.18
    worker NotReady 4h9m v1.25.1 10.2.0.7 Ubuntu 20.04.5 LTS 5.15.0-1030-gcp containerd://1.6.18
    student@master:~$ kubectl get nodes -o wide
    NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
    master Ready control-plane 6h2m v1.25.1 10.2.0.6 Ubuntu 20.04.5 LTS 5.15.0-1030-gcp containerd://1.6.18
    worker NotReady 4h9m v1.25.1 10.2.0.7 Ubuntu 20.04.5 LTS 5.15.0-1030-gcp containerd://1.6.18
    student@master:~$ kubectl get nodes -o wide
    NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
    master Ready control-plane 6h2m v1.25.1 10.2.0.6 Ubuntu 20.04.5 LTS 5.15.0-1030-gcp containerd://1.6.18
    worker NotReady 4h9m v1.25.1 10.2.0.7 Ubuntu 20.04.5 LTS 5.15.0-1030-gcp containerd://1.6.18
    student@master:~$

  • kubectl describe node worker

    most recent events

    Normal Starting 8m57s kubelet Starting kubelet.
    Warning InvalidDiskCapacity 8m57s kubelet invalid capacity 0 on image filesystem
    Normal NodeAllocatableEnforced 8m54s kubelet Updated Node Allocatable limit across pods
    Normal NodeHasNoDiskPressure 7m35s (x7 over 8m57s) kubelet Node worker status is now: NodeHasNoDiskPressure
    Normal NodeHasSufficientPID 7m35s (x7 over 8m57s) kubelet Node worker status is now: NodeHasSufficientPID
    Normal RegisteredNode 6m50s node-controller Node worker event: Registered Node worker in Controller
    Normal NodeHasSufficientMemory 3m10s (x10 over 8m57s) kubelet Node worker status is now: NodeHasSufficientMemory
    Normal NodeNotReady 2m30s (x2 over 6m10s) node-controller Node worker status is now: NodeNotReady

    kubectl -n kube-system describe pod calico-node-tv5cx

    Events:
    Type Reason Age From Message
    ---- ------ ---- ---- -------
    Warning Unhealthy 15m (x396 over 144m) kubelet (combined from similar events): Readiness probe failed: 2023-03-14 16:55:26.722 [INFO][25020] confd/health.go 180: Number of node(s) with BGP peering established = 0
    calico/node is not ready: BIRD is not ready: BGP not established with 10.2.0.7
    Normal SandboxChanged 9m33s kubelet Pod sandbox changed, it will be killed and re-created.
    Normal Pulled 9m33s kubelet Container image "docker.io/calico/cni:v3.25.0" already present on machine
    Normal Created 9m33s kubelet Created container upgrade-ipam
    Normal Started 9m33s kubelet Started container upgrade-ipam
    Normal Pulled 9m31s kubelet Container image "docker.io/calico/cni:v3.25.0" already present on machine
    Normal Created 9m31s kubelet Created container install-cni
    Normal Started 9m31s kubelet Started container install-cni
    Normal Pulled 9m27s kubelet Container image "docker.io/calico/node:v3.25.0" already present on machine
    Normal Created 9m27s kubelet Created container mount-bpffs
    Normal Started 9m27s kubelet Started container mount-bpffs
    Normal Pulled 9m26s kubelet Container image "docker.io/calico/node:v3.25.0" already present on machine
    Normal Created 9m26s kubelet Created container calico-node
    Normal Started 9m26s kubelet Started container calico-node
    Warning Unhealthy 9m25s kubelet Readiness probe failed: calico/node is not ready: BIRD is not ready: Error querying BIRD: unable to connect to BIRDv4 socket: dial unix /var/run/bird/bird.ctl: connect: no such file or directory
    Warning Unhealthy 9m23s (x2 over 9m24s) kubelet Readiness probe failed: calico/node is not ready: BIRD is not ready: Error querying BIRD: unable to connect to BIRDv4 socket: dial unix /var/run/calico/bird.ctl: connect: connection refused
    Warning Unhealthy 9m13s kubelet Readiness probe failed: 2023-03-14 17:01:32.575 [INFO][248] confd/health.go 180: Number of node(s) with BGP peering established = 0
    calico/node is not ready: BIRD is not ready: BGP not established with 10.2.0.7
    Warning Unhealthy 9m3s kubelet Readiness probe failed: 2023-03-14 17:01:42.940 [INFO][272] confd/health.go 180: Number of node(s) with BGP peering established = 0
    calico/node is not ready: BIRD is not ready: BGP not established with 10.2.0.7
    Warning Unhealthy 5m23s kubelet Readiness probe failed: 2023-03-14 17:05:22.612 [INFO][946] confd/health.go 180: Number of node(s) with BGP peering established = 0
    calico/node is not ready: BIRD is not ready: BGP not established with 10.2.0.7
    Warning Unhealthy 2m13s kubelet Readiness probe failed: 2023-03-14 17:08:32.587 [INFO][1457] confd/health.go 180: Number of node(s) with BGP peering established = 0
    calico/node is not ready: BIRD is not ready: BGP not established with 10.2.0.7
    Warning Unhealthy 2m3s kubelet Readiness probe failed: 2023-03-14 17:08:42.513 [INFO][1489] confd/health.go 180: Number of node(s) with BGP peering established = 0
    calico/node is not ready: BIRD is not ready: BGP not established with 10.2.0.7
    Warning Unhealthy 113s kubelet Readiness probe failed: 2023-03-14 17:08:52.532 [INFO][1515] confd/health.go 180: Number of node(s) with BGP peering established = 0
    calico/node is not ready: BIRD is not ready: BGP not established with 10.2.0.7
    Warning Unhealthy 113s kubelet Readiness probe failed: 2023-03-14 17:08:52.765 [INFO][1537] confd/health.go 180: Number of node(s) with BGP peering established = 0
    calico/node is not ready: BIRD is not ready: BGP not established with 10.2.0.7
    Warning Unhealthy 93s (x2 over 103s) kubelet (combined from similar events): Readiness probe failed: 2023-03-14 17:09:12.519 [INFO][1578] confd/health.go 180: Number of node(s) with BGP peering established = 0
    calico/node is not ready: BIRD is not ready: BGP not established with 10.2.0.7

  • chrispokorni
    chrispokorni Posts: 2,349

    Hi @jhurlstone,

    I would recommend revisiting the infra provisioning steps and follow the custom VPC, subnet, and firewall config as presented in the video. Once fixed, the BGP should be established with the two calico-node pods 1/1 Running, and the worker should become Ready as well.

    Regards,
    -Chris

  • Hi @chrispokorni

    I have run through the video and checked all the settings are the same with the following exceptions, "Region" (I have chosen a region local to me in the UK) and in the firewall setup the instructor has selected "IP Ranges" as the source filter, that is not an available option so I have chosen "IPv4 ranges", these are the only differences I can see, as for routing there is an option for "Dynamic routing mode" which my setup is configured identical to the video as in set to "Regional".

    Many thanks for your ongoing assistance with this.

  • Just to let you know I have got it working by following the instructions on this link

    https://www.unixcloudfusion.in/2022/02/solved-caliconode-is-not-ready-bird-is.html

    Cheers

  • chrispokorni
    chrispokorni Posts: 2,349

    Hi @jhurlstone,

    Thank you for posting the solution that worked for you.
    Unfortunately, I was not able to reproduce the issue you reported therefore I could not test the suggested solution either, but will keep it in mind for the future.

    Regards,
    -Chris

  • Hi Hi @chrispokorni

    Another thought was that when I built my VM's following the instructions in the video and selected "Ubuntu 20.04 LTS" I was forced to select x86/64 architecture, whether this made a difference to the ability of Calico to identify the Ethernet card property ("eth" or "ens") would be a guess, at the moment I am just happy I could resolve the issue and continue with the training, cheers Jonathan.

Categories

Upcoming Training