Welcome to the Linux Foundation Forum!

Issues getting worker node to join cluster in Lab3.2

As the title says - I'm having issues with the worker node in Lab 3.2.

I got the install finished successfully on the cp node. I am able to perform every step on the worker node up until the "kubeadm join" command. When I run the join command, it hangs on the following step:

[kubelet-start] Starting the kubelet
[kubelet-check] Waiting for a healthy kubelet at http://127.0.0.1:10248/healthz. This can take up to 4m0s
[kubelet-check] The kubelet is not healthy after 4m1.996320433s

When I look at "journalctl -xeu kubelet" for detailed errors, it tells me that conf files are missing/not found:

"Unhandled Error" err="unable to read existing bootstrap client config from /etc/kubernetes/kubelet.conf: invalid configuration
"command failed" err="failed to run Kubelet: unable to load bootstrap kubeconfig: stat /etc/kubernetes/bootstrap-kubelet.conf: no such file or directory"

Wouldn't it be expected behavior that the conf files are not there until after joining the cluster and receiving them from the cp? Are there steps missing in the lab instructions?

I am running the lab on local VMs. I can confirm that they have basic connectivity on all ports.

Thanks for the help!

Comments

  • chrispokorni
    chrispokorni Posts: 2,544

    Hi @vtvash,

    The lab guide is complete with the steps necessary to successfully bootstrap a two node Kubernetes cluster.

    However, the steps will fail if the Virtual environment is inadequate. VMs should have 2 vCPUs, 8 GB RAM, 20 GB vdisk (fully allocated), single bridged network interface, IP addresses that do not overlap 10.96.0.0/12 and 192.168.0.0/16 ranges. Guest OS Ubuntu 24.04 LTS. The hypervisor firewall should allow all incoming traffic from all sources, all protocols, to all ports (promiscuous mode enabled and set to allow-all).

    What is the host OS and architecture? What hypervisor are you running? Is nested virtualization enabled?

    Are there any work-related security controls active on your host?

    Regards,
    -Chris

  • vtvash
    vtvash Posts: 10

    VMs meet the minimum requirements, are on a bridged network interface, use the correct RFC1918 sub-range (I don't see those /12 and /16 subnet ranges listed anywhere, but thankfully the addresses they have fit). The network policy is wide open.

    All that being said, the error talks about a conf file on the local machine. Based on the pre-checks, it appears to be validating the local config before it even attempts network connectivity. Are the outputs a red herring? Is there anything else I can do to try to make it work? Do I just need to blow the VM away, start over, and hope that works?

  • vtvash
    vtvash Posts: 10

    Bumping this because I'm still having issues.

    I went back and totally wiped out both VMs and started from scratch:
    Host: Win11 Home running Oracle VirtualBox, host has 16 core CPU and 32GB RAM
    VM specs:
    -7680MB RAM, 2 cores, 25GB storage
    -Nested virtualization enabled
    -Bridged network adapter w/Promiscuous mode allow all
    -IP tables set to ACCEPT across all policies
    -each VM has a 192.168.1.x/24 RFC1918 IP
    OS: Ubuntu-server-24.04.3

    I can confirm that both VMs are reachable from the host, and that the VMs can see/reach each other.

    The kubernetes installation and kubeadm init commands all work fine on the cp, but when I try to join with the worker node, that's what fails.

    I generate a fresh join token/command from the cp and then use it to try to join from the worker. Each time, the worker fails to "validate" because it can't find specific files.

    kubelet[1232]: E1204 17:04:43.349077 1232 bootstrap.go:241] "Unhandled Error" err="unable to read existing bootstrap client config from /etc/kubernetes/kubelet.conf: invalid configuration: [unable to read client-cert /var/lib/kubelet/pki/kubelet-client-current.pem for default-auth
    kubelet[1232]: E1204 17:04:43.350711 1232 run.go:72] "command failed" err="failed to run Kubelet: unable to load bootstrap kubeconfig: stat /etc/kubernetes/bootstrap-kubelet.conf: no such file or directory"
    systemd[1]: kubelet.service: Main process exited, code=exited, status=1/FAILURE

    There are no specific steps to copy or create these files on the worker node. Shouldn't these files come from the cp anyway? Is there some other dependency that is missing from local VMs that is being overlooked because cloud providers include/set it automatically?

  • chrispokorni
    chrispokorni Posts: 2,544

    Hi @vtvash,

    There are no specific steps to copy or create these files on the worker node. Shouldn't these files come from the cp anyway? Is there some other dependency that is missing from local VMs that is being overlooked because cloud providers include/set it automatically?

    The labs work as provided on cloud VMs and on local VMs (VirtualBox, KVM, Hyper-V, ...). No steps are necessary to copy config files from the control plane to the worker.

    As mentioned earlier, overlapping RFC1918 IP ranges are not ideal, but first, let's get your cluster nodes joined, and worry about the container RFC1918 IP range later.

    To determine the state of your VMs, please provide the following details.
    Attach the kubeadm-init.out output file generated by the kubeadm init on your control plane node.
    Also attach the kubeadm-config.yaml manifest that is used by kubeadm init.
    Provide the hosts files of the control plane and the worker VMs.

    For each VM, control plane and worker, provide the outputs of:

    hostname -i
    hostname -I
    hostname -a
    hostname -A
    

    Provide the outputs of the following commands from the control plane VM. Capture the prompt (user@host), command, and output:

    kubectl config view
    kubectl get nodes -o wide
    kubectl get pods -A -o wide
    sudo kubeadm token create --print-join-command
    

    Provide the outputs of the following commands from the worker VM. Capture the prompt (user@host), command, and output:

    sudo kubeadm join ...
    

    Regards,
    -Chris

  • vtvash
    vtvash Posts: 10

    Chris,

    Here are the outputs you requested.

    As info - I also included the iptables of the cp node in the "requested outputs" text file because I saw where the "kubectl get" commands returned some localhost connection refused messages. It looks like during the install there was a KUBE-FIREWALL category added to iptables that drops non-localhost sources trying hit localhost on the cp, but that seems correct to me.

    Maybe these connection refused messages are expected when there are no actual nodes other than the cp, but I thought it might be helpful info.

    Thanks for your help,

  • chrispokorni
    chrispokorni Posts: 2,544

    Hi @vtvash,

    I can't seem to find the hosts files in the attachments.

    Also, you seem to attempt to run kubectl as root. That is inconsistent with the lab guide, where a non-root user ID (student for the purpose of the lab guide) is the owner of .kube/config manifest, and is enabled to run kubectl commands.

    Regards,
    -Chris

  • vtvash
    vtvash Posts: 10

    So, I tried uploading the missing outputs here and it's saying I'm blocked. I guess something about my filenames triggered the site.

    How can I get this resolved and/or send you the requested outputs?

  • vtvash
    vtvash Posts: 10

    Trying again:

    My mistake, you're right, kubectl is not supposed to be run as root in the lab guides. All of the worker node installation steps and kubeadm join are though.

    I'm attaching an updated file for those kubectl outputs from the cp.

  • fcioanca
    fcioanca Posts: 2,409

    Have you tried using the code block option in the text ribbon? It should allow you to share the output with the code block. Or you can share screenshots.

  • vtvash
    vtvash Posts: 10

    Ok - I guess it's flagging my attempts to put anything related to the hosts entries into my responses. Hopefully screenshots upload ok

  • vtvash
    vtvash Posts: 10

    @fcioanca said:
    Have you tried using the code block option in the text ribbon? It should allow you to share the output with the code block. Or you can share screenshots.

    Yes, but it still told me I was blocked. Thankfully screenshots worked.

    Thanks for your help,

  • chrispokorni
    chrispokorni Posts: 2,544

    Hi @vtvash,

    So far the hosts, nodes, and the pods seem to be as expected.

    My first concern is the multi-IP assignment on the control plane VM - respectively 192.168.1.194 and .195. I recommend removing the extra interface. As mentioned earlier, a single bridged adapter is sufficient for this lab environment for all networking needs of the VM. Ensure the interface you keep is aliased by k8scp and the hosts files of the control plane and worker VMs respectively.

    The second concern is the IP range overlap with the Pod network 192.168.0.0/16. To avoid this I strongly advise for a 10.200.0.0/16 or /24 IP range for the VirtualBox VMs. Your current IP address schema will overlap with the Cilium managed Pod network causing conflicts and routing issues in your cluster.

    For the IP addresses to be corrected on the VMs, the Kubernetes nodes will need a reset to remove all Kubernetes configuration and clean-up each VM, as such:

    sudo kubeadm reset

    Once both nodes are reset, feel free to perform any network interface removal and IP address reassignment tasks that are necessary. Edit both hosts files if necessary. Perform a new kubeadm init on the control plane VM. Upon completion, perform a connectivity check from the worker VM, with netcat. Run the command once with the k8scp alias, and once with the IP address of the control plane VM. (Replace the sample IP 10.200.0.7 with you control plane VM IP).

    Run netcat with -zv on IP 10.200.0.7 port 6443
    Run netcat with -zv on alias k8scp port 6443
    [actual commands omitted to avoid getting blocked]

    If both connections success, then attempt to generate the join command and execute it from the worker VM.

    Regards,
    -Chris

  • vtvash
    vtvash Posts: 10

    @chrispokorni

    TL;DR - You were correct, the issue was caused by the IP conflict of my existing 192.168.1.0/24 host network with the default pod range of 192.168.0.0/16.

    I deleted and re-created my VMs and modified the kubeadm-config.yaml and cilium-cni.yaml files to use 10.100.0.0/16 for the pod network range (to avoid conflicting with my current host network range). I realize this means I need to keep track of anything potentially referencing 192.168.0.0/16 to keep consistent, but it is what works best in my scenario.

    Once I changed the pod IP range, everything worked as expected.

    I thought that the fact that there were thousands of possible IPs between 192.168.0.0 and 192.168.1.0 would save me from IP conflicts. I was wrong :D lol. At least it was a good learning experience I suppose.

    Thank you for all of your help,

  • chrispokorni
    chrispokorni Posts: 2,544

    Hi @vtvash,

    While you resolved the issues caused by overlapping VM and Pod IP address ranges, at times you may see slightly ambiguous IP addresses being assigned between your Pods and your Services, as the selected Pod IP range overlaps slightly the default Service range - details also noted earlier in this thread.

    Regards,
    -Chris

  • vtvash
    vtvash Posts: 10

    Thanks for the heads up.

    Yeah, I noticed the 10.96.0.0/12 range later in the lab. There should be enough IPs available between 10.96 and 10.100 to give it plenty of space, but fingers crossed.

    What is the recommended strategy for resizing a cluster? At an enterprise level I don't think many networks can support having an entire /12 dedicated to a "single" service. So, if you set an initial cluster IP range and need to grow it. Can that be done without resetting the entire cluster? I know tools like Ansible could help make re-creating the cluster faster, but still, you'd feel safer without having to totally tear down and rebuild.

    Just curious.

    Thanks again,

Categories

Upcoming Training