Welcome to the Linux Foundation Forum!

Issues with running k8sMaster.sh: Swap & Time Out

First, thanks for this excellent course! I had made it up through most of unit 4 before the recent update. I understand that these updates need to happen, and I appreciate that we're being prepared for the latest version of the exam. That said, I'm running into some issues getting started with new VMs with the new k8sMaster.sh script.

Setup

This is the set up that was working fine with the V2021-01-26 materials. (The only maintenance I need to do was run sudo ntpdate time.nist.gov after restarting the VMs and occasionally rerun the k8sMaster.sh / k8sSecond.sh commands on the respective VMs.)

  • 2 Ubuntu VMs running locally via VirtualBox on Windows 10 Business, v21H1, 10.0.19043, 32 GB RAM, i9010885H CPU @ 2.40GHz
    • 2 GB RAM, 2 processor cores, 25 GB virtual disk image, Network: Attached to Bridged Adapter to Ethernet port
    • OS installed via ubuntu-18.04.5-live-server-amd64.iso (Date modified: 2021-03-20 2:58 PM)

I'm having the following issues when running the V2021-05-26 version of the set up scripts:
1. This one might be a recommendation for updating the scripts. When running bash k8sMaster.sh | tee $HOME/master.out, there's an error about swap not being disabled. So now, as a workaround, each time I re-attempt on a fresh VM, I disable swap before running the scripts. (I either run sudo swapoff -aor edit /etc/fstab and reboot.)

[preflight] Running pre-flight checks
error execution phase preflight: [preflight] Some fatal errors occurred:
        [ERROR Swap]: running with swap on is not supported. Please disable swap
  1. This is where my current support question lies. There is a timeout in the [kubelet-check] step (after the sudo kubeadm init --config=$(find / -name kubeadm.yaml 2>/dev/null ) command. See pertinent output below. I'll also include samples of the output of the debugging steps recommended by the output.
[wait-control-plane] Waiting for the kubelet to boot up the control plane as static Pods from directory "/etc/kubernetes/manifests". This can take up to 4m0s
[kubelet-check] Initial timeout of 40s passed.

        Unfortunately, an error has occurred:
                timed out waiting for the condition

        This error is likely caused by:
                - The kubelet is not running
                - The kubelet is unhealthy due to a misconfiguration of the node in some way (required cgroups disabled)

        If you are on a systemd-powered system, you can try to troubleshoot the error with the following commands:
                - 'systemctl status kubelet'
                - 'journalctl -xeu kubelet'

        Additionally, a control plane component may have crashed or exited when started by the container runtime.
        To troubleshoot, list all containers using your preferred container runtimes CLI.

        Here is one example how you may list all Kubernetes containers running in cri-o/containerd using crictl:
                - 'crictl --runtime-endpoint unix:///var/run/crio/crio.sock ps -a | grep kube | grep -v pause'
                Once you have found the failing container, you can inspect its logs with:
                - 'crictl --runtime-endpoint unix:///var/run/crio/crio.sock logs CONTAINERID'

error execution phase wait-control-plane: couldn't initialize a Kubernetes cluster
To see the stack trace of this error execute with --v=5 or higher

The output of the recommended debugging steps are:

[email protected]:~$ systemctl status kubelet
● kubelet.service - kubelet: The Kubernetes Node Agent
   Loaded: loaded (/lib/systemd/system/kubelet.service; enabled; vendor preset: enabled)
  Drop-In: /etc/systemd/system/kubelet.service.d
           └─10-kubeadm.conf
   Active: active (running) since Sun 2021-05-30 20:18:49 UTC; 5h 21min ago
     Docs: https://kubernetes.io/docs/home/
 Main PID: 20472 (kubelet)
    Tasks: 15 (limit: 2316)
   CGroup: /system.slice/kubelet.service
           └─20472 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubelet/config.yaml --container-runtime=remote --co
May 31 01:40:44 master kubelet[20472]: E0531 01:40:44.774526   20472 kuberuntime_sandbox.go:68] "Failed to create sandbox for pod" err="rpc error: code = Unknown desc = failed to mount container k8s_POD_kube-sMay 31 01:40:44 master kubelet[20472]: E0531 01:40:44.774546   20472 kuberuntime_manager.go:790] "CreatePodSandbox for pod failed" err="rpc error: code = Unknown desc = failed to mount container k8s_POD_kube-sMay 31 01:40:44 master kubelet[20472]: E0531 01:40:44.774591   20472 pod_workers.go:190] "Error syncing pod, skipping" err="failed to \"CreatePodSandbox\" for \"kube-scheduler-master_kube-system(22ea193343aa28May 31 01:40:44 master kubelet[20472]: E0531 01:40:44.808612   20472 kubelet.go:2291] "Error getting node" err="node \"master\" not found"
May 31 01:40:44 master kubelet[20472]: E0531 01:40:44.908942   20472 kubelet.go:2291] "Error getting node" err="node \"master\" not found"
# 5 similar lines omitted

and

[email protected]:~$ journalctl -xeu kubelet
# (1001 lines, but here's a sample)
May 31 02:20:52 master kubelet[20472]: E0531 02:20:52.945247   20472 kubelet.go:2291] "Error getting node" err="node \"master\" not found"
May 31 02:20:52 master kubelet[20472]: E0531 02:20:52.985422   20472 eviction_manager.go:255] "Eviction manager: failed to get summary stats" err="failed to get node info: node \"master\" not found"
May 31 02:20:53 master kubelet[20472]: E0531 02:20:53.045960   20472 kubelet.go:2291] "Error getting node" err="node \"master\" not found"

and

[email protected]:~$ sudo crictl --runtime-endpoint unix:///var/run/crio/crio.sock ps -a | grep kube | grep -v pause
[sudo] password for master:
# [no output]

Running the command with the verbosity flag (sudo kubeadm init --config=$(find ~ -name kubeadm.yaml 2>/dev/null ) --v=5) doesn't add any more useful information

Comments

  • serewiczserewicz Posts: 899

    Hello,

    Please refer to the Exercise 2.1 Overview and Preliminaries section. The system requirements are both explained in text as well as in summary, including the need to disable swap. With far less memory, you wrote 2G when the course requires 8G, you will encounter many odd errors including pods not starting.

    The error about master not being found, is that the actual hostname of your vm? Are you using the updated tarball? If you use the updated script then a previous step will create an alias k8scp in /etc/hosts and that becomes the name used for the control plane node. Please check you are using the updated files.

    Regards,

  • trenchguineatrenchguinea Posts: 4

    I'm having this same problem. Thinking it might be part of the issue, I also added another alias to /etc/hosts called "master" (since the script added k8smaster and the pre-flight messages were complaining about not finding a "master" node). The pre-flight warnings went away but the problem did not.

    Each VM has 2 CPU assigned with 6 GB each (host device isn't large enough for VMs to go bigger).

    I've verified my two VMs can ping / ssh each other. No firewall running on the host or on either VM. The errors in the journal are filled with the same "failed to mount" errors the OP had above as well as "connection refused" K8S API calls on port 6443.

    I have 1 network adapter for each VM in bridged mode, and all nodes have 10.0.0.x addresses.

    Anyone have ideas on where else to look?

  • chrispokornichrispokorni Posts: 1,032

    Hi @trenchguinea,

    Did you enable promiscuous mode and allow all traffic for the bridged adapters? Also, what is the size of your host (CPU and MEM)?

    Regards,
    -Chris

  • serewiczserewicz Posts: 899

    Hello,

    I was also wondering what are the details of the OS you are using? There may be a version of Ubuntu 18.04 that has an issue with the default kernel, but I am still trying to figure out which version(s) have the issue.

    Regards,

  • trenchguineatrenchguinea Posts: 4
    edited June 9

    Yes, both bridge adapters (master and worker) are set to promiscuous and all traffic allowed. Host OS is Mac OS X 11.2.3 (Big Sur), and VMs are running Ubuntu 18.04.5. Host system has 8 core CPU w/ 16 GB RAM. Each VM was given 2 core, 6 GB, and 40 GB storage. Available RAM at the time of running kubeadm is 4 GB.

  • chrispokornichrispokorni Posts: 1,032

    Hi @trenchguinea,

    Although resources are a bit tight in your setup, I would expect all components to work with the assigned amount of resources.

    There have been issues reported on Big Sur's support for VirtualBox VMs, but I am not aware whether they have been fixed. Assuming that the guest additions have installed properly, and the host's firewall allows all traffic to your guest VMs, it still seems to be a networking issue. Ping shows you that the VMs can see each other, while ssh shows that port 22 is listening. Your "connection refused" on port 6443 shows that traffic to that port may still be blocked.

    Regards,
    -Chris

  • Rather than traffic to the port being blocked, I think the issue might be that the service that's supposed to be listening on port 6443 isn't running. Looking at the journal logs, I'm guessing that service is the apiserver container. If that container doesn't start then that's another reason all these connection refused errors would appear. But then the question becomes: why wouldn't that container start?

    'systemctl status crio' shows the service is running.

  • serewiczserewicz Posts: 899
    edited June 10

    Hello,

    Indeed I suspect that your kubeadm init did not complete. It is a resource intensive process. A couple of things to try:

    Put all resources into the control plane node so it has 2cpu and 8G of memory, or more, and see if kubeadm init works, if it does we know it was just a resource issue.

    Build the cluster using Docker. If it works using Docker we can rule out issues with networking, but still could be resource related.

    Double check that your IP and /etc/hosts entries are correct.

    Ensure there is only one interface to each VM. Multiple interfaces can cause issues and requires other settings.

    Hopefully these tests allow us to find the issue you are experiencing.

    Regards,

  • I increased the RAM to 10 GB but it still didn't work so set back down to 6 GB.

    I then switched the --cri-socket to /var/run/dockershim.sock (after changing docker to use systemd for the cgroup) and kubeadm was able to finish successfully. So I guess I'll continue with the labs using Docker. Will that cause issues downstream or will I be able to complete the labs without issue in this setup?

    I wonder if the networking issue was related to some documentation I read that promiscuous mode doesn't work with all wireless adapters, and that's what I'm using.

  • serewiczserewicz Posts: 899

    Hello,

    Docker works fine and is better supported among some tools. The community is moving away from Docker, which is why the class has recently moved to cri-o. If you look in the course tarball I have left in some of the Docker based scripts.

    Production clusters would rarely use wireless, which could be the cause. Some driver could not be fully integrated with calls from cri-o yet.

    Regards,

  • leifsegenleifsegen Posts: 7
    edited June 21

    @serewicz Thanks for pointing out my RAM error. When using 8gb RAM, 2CPUs, 25 GB virtual hard drive, Ubuntu 18.04.5, bridged network, on a fresh VM named 'cp', manually turning off swap first, and then using the new V2021-06-15 k8scp.sh script, I'm still getting the time out error:

    [wait-control-plane] Waiting for the kubelet to boot up the control plane as static Pods from directory "/etc/kubernetes/manifests". This can take up to 4m0s
    [kubelet-check] Initial timeout of 40s passed.
    
            Unfortunately, an error has occurred:
                    error execution phase wait-control-plane: couldn't initialize a Kubernetes cluster
    To see the stack trace of this error execute with --v=5 or higher
    timed out waiting for the condition
    
            This error is likely caused by:
                    - The kubelet is not running
                    - The kubelet is unhealthy due to a misconfiguration of the node in some way (required cgroups disabled)
            If you are on a systemd-powered system, you can try to troubleshoot the error with the following commands:
                    - 'systemctl status kubelet'
                    - 'journalctl -xeu kubelet'
    
            Additionally, a control plane component may have crashed or exited when started by the container runtime.
            To troubleshoot, list all containers using your preferred container runtimes CLI.
    
            Here is one example how you may list all Kubernetes containers running in cri-o/containerd using crictl:
                    - 'crictl --runtime-endpoint unix:///var/run/crio/crio.sock ps -a | grep kube | grep -v pause'
                    Once you have found the failing container, you can inspect its logs with:
                    - 'crictl --runtime-endpoint unix:///var/run/crio/crio.sock logs CONTAINERID'
    

    Output of systemctl status kubelet:

    ● kubelet.service - kubelet: The Kubernetes Node Agent
       Loaded: loaded (/lib/systemd/system/kubelet.service; enabled; vendor preset: enabled)
      Drop-In: /etc/systemd/system/kubelet.service.d
               └─10-kubeadm.conf
       Active: active (running) since Sun 2021-06-20 16:17:07 UTC; 20h ago
         Docs: https://kubernetes.io/docs/home/
     Main PID: 6587 (kubelet)
        Tasks: 16 (limit: 4915)
       CGroup: /system.slice/kubelet.service
               └─6587 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubelet/config.yaml --container-runtime=remo
    Jun 21 13:04:59 cp kubelet[6587]: E0621 13:04:59.745856    6587 kubelet.go:2291] "Error getting node" err="node \"cp\" not found"
    [duplicates omitted]
    

    A sampling of the output of journalctl -xeu kubelet includes:

    kubelet.go:2291] "Error getting node" err="node \"cp\" not found"                                                                                
    remote_runtime.go:116] "RunPodSandbox from runtime service failed" err="rpc error: code = Unknown desc = failed to mount container k8s_POD_kube-a
    kuberuntime_sandbox.go:68] "Failed to create sandbox for pod" err="rpc error: code = Unknown desc = failed to mount container k8s_POD_kube-apiser
    kuberuntime_manager.go:790] "CreatePodSandbox for pod failed" err="rpc error: code = Unknown desc = failed to mount container k8s_POD_kube-apiser
    pod_workers.go:190] "Error syncing pod, skipping" err="failed to \"CreatePodSandbox\" for \"kube-apiserver-cp_kube-system(9976d2d1b70978ac90aee44
    kubelet.go:2291] "Error getting node" err="node \"cp\" not found"
    kubelet.go:2291] "Error getting node" err="node \"cp\" not found"
    certificate_manager.go:437] Failed while requesting a signed certificate from the master: cannot create certificate signing request: Post "https:
    kubelet.go:2291] "Error getting node" err="node \"cp\" not found"
    kubelet.go:2291] "Error getting node" err="node \"cp\" not found"
    controller.go:144] failed to ensure lease exists, will retry in 7s, error: Get "https://[net.wo.rk.ip]:6443/apis/coordination.k8s.io/v1/namespaces/
    kubelet.go:2291] "Error getting node" err="node \"cp\" not found"
    kubelet_node_status.go:71] "Attempting to register node" node="cp"
    kubelet_node_status.go:93] "Unable to register node with API server" err="Post \"https://[net.wo.rk.ip]:6443/api/v1/nodes\": dial tcp [net.wo.rk.ip]:
    kubelet.go:2291] "Error getting node" err="node \"cp\" not found"
    kubelet.go:2291] "Error getting node" err="node \"cp\" not found"
    event.go:273] Unable to write event: '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"cp.168a569ad02f168c"
    event.go:218] Unable to write event '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"cp.168a569af27001cc",
    event.go:273] Unable to write event: '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"cp.168a569ad02f1eb1"
    kubelet.go:2291] "Error getting node" err="node \"cp\" not found"                          
    kubelet.go:2291] "Error getting node" err="node \"cp\" not found"
    controller.go:144] failed to ensure lease exists, will retry in 7s, error: Get "https://[net.wo.rk.ip]:6443/apis/coordination.k8s.io/v1/namespaces/
    kubelet.go:2291] "Error getting node" err="node \"cp\" not found"
    kubelet_node_status.go:71] "Attempting to register node" node="cp"
    kubelet_node_status.go:93] "Unable to register node with API server" err="Post \"https://[net.wo.rk.ip]:6443/api/v1/nodes\": dial tcp [net.wo.rk.ip]:
    kubelet.go:2291] "Error getting node" err="node \"cp\" not found"                                                                                 
    kubelet.go:2291] "Error getting node" err="node \"cp\" not found"
    eviction_manager.go:255] "Eviction manager: failed to get summary stats" err="failed to get node info: node \"cp\" not found"                    
    kubelet.go:2291] "Error getting node" err="node \"cp\" not found"
    

    If these Virtualbox setups just aren't fully supported anymore, could you let us know? I'll try to move forward with AWS or GCP in that case.

  • leifsegenleifsegen Posts: 7

    In case this applies, about the same time that [wait-control-plane] line starts, the main VM window starts displaying this:

  • leifsegenleifsegen Posts: 7

    And maybe I should probably include these lines at the end of the k8scp.sh output - specifically lines 1 and 26, which show the impact of the timeout error:

    error execution phase wait-control-plane: couldn't initialize a Kubernetes cluster
    To see the stack trace of this error execute with --v=5 or higher
    Running the steps explained at the end of the init output for you
    Apply Calico network plugin from ProjectCalico.org
    If you see an error they may have updated the yaml file
    Use a browser, navigate to the site and find the updated file
    The connection to the server [net.wo.rk.ip]:6443 was refused - did you specify the right host or port?
    
    alias sudo=sudo
    alias docker=podman
    --2021-06-21 14:08:48--  https://get.helm.sh/helm-v3.5.4-linux-amd64.tar.gz
    Resolving get.helm.sh (get.helm.sh)... 152.195.19.97, 2606:2800:11f:1cb7:261b:1f9c:2074:3c
    Connecting to get.helm.sh (get.helm.sh)|152.195.19.97|:443... connected.
    HTTP request sent, awaiting response... 200 OK
    Length: 12378363 (12M) [application/x-tar]
    Saving to: ‘helm-v3.5.4-linux-amd64.tar.gz’
    
    helm-v3.5.4-linux-amd64.tar.gz                       100%[===================================================================================================================>]  11.80M  9.38MB/s    in 1.3s
    
    2021-06-21 14:08:50 (9.38 MB/s) - ‘helm-v3.5.4-linux-amd64.tar.gz’ saved [12378363/12378363]
    
    
    You should see this node in the output below
    It can take up to a mintue for node to show Ready status
    
    The connection to the server [net.wo.rk.ip]:6443 was refused - did you specify the right host or port?
    
    
    Script finished. Move to the next step
    
  • serewiczserewicz Posts: 899

    Hello,

    When I look at your errors I note that there are a lot of "cp" not found, but your node name seems to be net.wo.rk.ip. The kubeadm.yaml script uses cp as the hostname, perhaps this is where the issue lies. Could you edit the kubeadm.yaml file and edit the name: value, around line 15 of the file, to be your hostname. I think this is the mismatch.

    Regards,

  • serewiczserewicz Posts: 899

    Hello,

    Indeed. The script needs to be fixed. When the node name is cp, what I use for testing, it works. I will work to fix the script and update the course tarball asap.

    Regards,

  • serewiczserewicz Posts: 899

    Well.... its more than that. I think something updated in cri-o, and it is having knock on issues. I can't get it to work regardless of the name now. This does happen with fast moving open source projects, sometimes an opps is pushed and it takes a bit to figure it out. I'll keep grinding on it, but wanted to give an update.

Sign In or Register to comment.