Issues with running k8sMaster.sh: Swap & Time Out

leifsegen · May 2021

First, thanks for this excellent course! I had made it up through most of unit 4 before the recent update. I understand that these updates need to happen, and I appreciate that we're being prepared for the latest version of the exam. That said, I'm running into some issues getting started with new VMs with the new k8sMaster.sh script.

Setup

This is the set up that was working fine with the V2021-01-26 materials. (The only maintenance I need to do was run sudo ntpdate time.nist.gov after restarting the VMs and occasionally rerun the k8sMaster.sh / k8sSecond.sh commands on the respective VMs.)

2 Ubuntu VMs running locally via VirtualBox on Windows 10 Business, v21H1, 10.0.19043, 32 GB RAM, i9010885H CPU @ 2.40GHz
- 2 GB RAM, 2 processor cores, 25 GB virtual disk image, Network: Attached to Bridged Adapter to Ethernet port
- OS installed via ubuntu-18.04.5-live-server-amd64.iso (Date modified: 2021-03-20 2:58 PM)

I'm having the following issues when running the V2021-05-26 version of the set up scripts:
1. This one might be a recommendation for updating the scripts. When running bash k8sMaster.sh | tee $HOME/master.out, there's an error about swap not being disabled. So now, as a workaround, each time I re-attempt on a fresh VM, I disable swap before running the scripts. (I either run sudo swapoff -aor edit /etc/fstab and reboot.)

[preflight] Running pre-flight checks
error execution phase preflight: [preflight] Some fatal errors occurred:
        [ERROR Swap]: running with swap on is not supported. Please disable swap

This is where my current support question lies. There is a timeout in the [kubelet-check] step (after the sudo kubeadm init --config=$(find / -name kubeadm.yaml 2>/dev/null ) command. See pertinent output below. I'll also include samples of the output of the debugging steps recommended by the output.

[wait-control-plane] Waiting for the kubelet to boot up the control plane as static Pods from directory "/etc/kubernetes/manifests". This can take up to 4m0s
[kubelet-check] Initial timeout of 40s passed.

        Unfortunately, an error has occurred:
                timed out waiting for the condition

        This error is likely caused by:
                - The kubelet is not running
                - The kubelet is unhealthy due to a misconfiguration of the node in some way (required cgroups disabled)

        If you are on a systemd-powered system, you can try to troubleshoot the error with the following commands:
                - 'systemctl status kubelet'
                - 'journalctl -xeu kubelet'

        Additionally, a control plane component may have crashed or exited when started by the container runtime.
        To troubleshoot, list all containers using your preferred container runtimes CLI.

        Here is one example how you may list all Kubernetes containers running in cri-o/containerd using crictl:
                - 'crictl --runtime-endpoint unix:///var/run/crio/crio.sock ps -a | grep kube | grep -v pause'
                Once you have found the failing container, you can inspect its logs with:
                - 'crictl --runtime-endpoint unix:///var/run/crio/crio.sock logs CONTAINERID'

error execution phase wait-control-plane: couldn't initialize a Kubernetes cluster
To see the stack trace of this error execute with --v=5 or higher

The output of the recommended debugging steps are:

master@master:~$ systemctl status kubelet
● kubelet.service - kubelet: The Kubernetes Node Agent
   Loaded: loaded (/lib/systemd/system/kubelet.service; enabled; vendor preset: enabled)
  Drop-In: /etc/systemd/system/kubelet.service.d
           └─10-kubeadm.conf
   Active: active (running) since Sun 2021-05-30 20:18:49 UTC; 5h 21min ago
     Docs: https://kubernetes.io/docs/home/
 Main PID: 20472 (kubelet)
    Tasks: 15 (limit: 2316)
   CGroup: /system.slice/kubelet.service
           └─20472 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubelet/config.yaml --container-runtime=remote --co
May 31 01:40:44 master kubelet[20472]: E0531 01:40:44.774526   20472 kuberuntime_sandbox.go:68] "Failed to create sandbox for pod" err="rpc error: code = Unknown desc = failed to mount container k8s_POD_kube-sMay 31 01:40:44 master kubelet[20472]: E0531 01:40:44.774546   20472 kuberuntime_manager.go:790] "CreatePodSandbox for pod failed" err="rpc error: code = Unknown desc = failed to mount container k8s_POD_kube-sMay 31 01:40:44 master kubelet[20472]: E0531 01:40:44.774591   20472 pod_workers.go:190] "Error syncing pod, skipping" err="failed to \"CreatePodSandbox\" for \"kube-scheduler-master_kube-system(22ea193343aa28May 31 01:40:44 master kubelet[20472]: E0531 01:40:44.808612   20472 kubelet.go:2291] "Error getting node" err="node \"master\" not found"
May 31 01:40:44 master kubelet[20472]: E0531 01:40:44.908942   20472 kubelet.go:2291] "Error getting node" err="node \"master\" not found"
# 5 similar lines omitted

and

master@master:~$ journalctl -xeu kubelet
# (1001 lines, but here's a sample)
May 31 02:20:52 master kubelet[20472]: E0531 02:20:52.945247   20472 kubelet.go:2291] "Error getting node" err="node \"master\" not found"
May 31 02:20:52 master kubelet[20472]: E0531 02:20:52.985422   20472 eviction_manager.go:255] "Eviction manager: failed to get summary stats" err="failed to get node info: node \"master\" not found"
May 31 02:20:53 master kubelet[20472]: E0531 02:20:53.045960   20472 kubelet.go:2291] "Error getting node" err="node \"master\" not found"

and

master@master:~$ sudo crictl --runtime-endpoint unix:///var/run/crio/crio.sock ps -a | grep kube | grep -v pause
[sudo] password for master:
# [no output]

Running the command with the verbosity flag (sudo kubeadm init --config=$(find ~ -name kubeadm.yaml 2>/dev/null ) --v=5) doesn't add any more useful information

serewicz · June 2021

Hello,

Please refer to the Exercise 2.1 Overview and Preliminaries section. The system requirements are both explained in text as well as in summary, including the need to disable swap. With far less memory, you wrote 2G when the course requires 8G, you will encounter many odd errors including pods not starting.

The error about master not being found, is that the actual hostname of your vm? Are you using the updated tarball? If you use the updated script then a previous step will create an alias k8scp in /etc/hosts and that becomes the name used for the control plane node. Please check you are using the updated files.

Regards,

trenchguinea · June 2021

I'm having this same problem. Thinking it might be part of the issue, I also added another alias to /etc/hosts called "master" (since the script added k8smaster and the pre-flight messages were complaining about not finding a "master" node). The pre-flight warnings went away but the problem did not.

Each VM has 2 CPU assigned with 6 GB each (host device isn't large enough for VMs to go bigger).

I've verified my two VMs can ping / ssh each other. No firewall running on the host or on either VM. The errors in the journal are filled with the same "failed to mount" errors the OP had above as well as "connection refused" K8S API calls on port 6443.

I have 1 network adapter for each VM in bridged mode, and all nodes have 10.0.0.x addresses.

Anyone have ideas on where else to look?

chrispokorni · June 2021

Hi @trenchguinea,

Did you enable promiscuous mode and allow all traffic for the bridged adapters? Also, what is the size of your host (CPU and MEM)?

Regards,
-Chris

serewicz · June 2021

Hello,

I was also wondering what are the details of the OS you are using? There may be a version of Ubuntu 18.04 that has an issue with the default kernel, but I am still trying to figure out which version(s) have the issue.

Regards,

trenchguinea · June 2021

Yes, both bridge adapters (master and worker) are set to promiscuous and all traffic allowed. Host OS is Mac OS X 11.2.3 (Big Sur), and VMs are running Ubuntu 18.04.5. Host system has 8 core CPU w/ 16 GB RAM. Each VM was given 2 core, 6 GB, and 40 GB storage. Available RAM at the time of running kubeadm is 4 GB.

chrispokorni · June 2021

Hi @trenchguinea,

Although resources are a bit tight in your setup, I would expect all components to work with the assigned amount of resources.

There have been issues reported on Big Sur's support for VirtualBox VMs, but I am not aware whether they have been fixed. Assuming that the guest additions have installed properly, and the host's firewall allows all traffic to your guest VMs, it still seems to be a networking issue. Ping shows you that the VMs can see each other, while ssh shows that port 22 is listening. Your "connection refused" on port 6443 shows that traffic to that port may still be blocked.

Regards,
-Chris

trenchguinea · June 2021

Rather than traffic to the port being blocked, I think the issue might be that the service that's supposed to be listening on port 6443 isn't running. Looking at the journal logs, I'm guessing that service is the apiserver container. If that container doesn't start then that's another reason all these connection refused errors would appear. But then the question becomes: why wouldn't that container start?

'systemctl status crio' shows the service is running.

serewicz · June 2021

Hello,

Indeed I suspect that your kubeadm init did not complete. It is a resource intensive process. A couple of things to try:

Put all resources into the control plane node so it has 2cpu and 8G of memory, or more, and see if kubeadm init works, if it does we know it was just a resource issue.

Build the cluster using Docker. If it works using Docker we can rule out issues with networking, but still could be resource related.

Double check that your IP and /etc/hosts entries are correct.

Ensure there is only one interface to each VM. Multiple interfaces can cause issues and requires other settings.

Hopefully these tests allow us to find the issue you are experiencing.

Regards,

trenchguinea · June 2021

I increased the RAM to 10 GB but it still didn't work so set back down to 6 GB.

I then switched the --cri-socket to /var/run/dockershim.sock (after changing docker to use systemd for the cgroup) and kubeadm was able to finish successfully. So I guess I'll continue with the labs using Docker. Will that cause issues downstream or will I be able to complete the labs without issue in this setup?

I wonder if the networking issue was related to some documentation I read that promiscuous mode doesn't work with all wireless adapters, and that's what I'm using.

serewicz · June 2021

Hello,

Docker works fine and is better supported among some tools. The community is moving away from Docker, which is why the class has recently moved to cri-o. If you look in the course tarball I have left in some of the Docker based scripts.

Production clusters would rarely use wireless, which could be the cause. Some driver could not be fully integrated with calls from cri-o yet.

Regards,

leifsegen · June 2021

@serewicz Thanks for pointing out my RAM error. When using 8gb RAM, 2CPUs, 25 GB virtual hard drive, Ubuntu 18.04.5, bridged network, on a fresh VM named 'cp', manually turning off swap first, and then using the new V2021-06-15 k8scp.sh script, I'm still getting the time out error:

[wait-control-plane] Waiting for the kubelet to boot up the control plane as static Pods from directory "/etc/kubernetes/manifests". This can take up to 4m0s
[kubelet-check] Initial timeout of 40s passed.

        Unfortunately, an error has occurred:
                error execution phase wait-control-plane: couldn't initialize a Kubernetes cluster
To see the stack trace of this error execute with --v=5 or higher
timed out waiting for the condition

        This error is likely caused by:
                - The kubelet is not running
                - The kubelet is unhealthy due to a misconfiguration of the node in some way (required cgroups disabled)
        If you are on a systemd-powered system, you can try to troubleshoot the error with the following commands:
                - 'systemctl status kubelet'
                - 'journalctl -xeu kubelet'

        Additionally, a control plane component may have crashed or exited when started by the container runtime.
        To troubleshoot, list all containers using your preferred container runtimes CLI.

        Here is one example how you may list all Kubernetes containers running in cri-o/containerd using crictl:
                - 'crictl --runtime-endpoint unix:///var/run/crio/crio.sock ps -a | grep kube | grep -v pause'
                Once you have found the failing container, you can inspect its logs with:
                - 'crictl --runtime-endpoint unix:///var/run/crio/crio.sock logs CONTAINERID'

Output of systemctl status kubelet:

● kubelet.service - kubelet: The Kubernetes Node Agent
   Loaded: loaded (/lib/systemd/system/kubelet.service; enabled; vendor preset: enabled)
  Drop-In: /etc/systemd/system/kubelet.service.d
           └─10-kubeadm.conf
   Active: active (running) since Sun 2021-06-20 16:17:07 UTC; 20h ago
     Docs: https://kubernetes.io/docs/home/
 Main PID: 6587 (kubelet)
    Tasks: 16 (limit: 4915)
   CGroup: /system.slice/kubelet.service
           └─6587 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubelet/config.yaml --container-runtime=remo
Jun 21 13:04:59 cp kubelet[6587]: E0621 13:04:59.745856    6587 kubelet.go:2291] "Error getting node" err="node \"cp\" not found"
[duplicates omitted]

A sampling of the output of journalctl -xeu kubelet includes:

kubelet.go:2291] "Error getting node" err="node \"cp\" not found"                                                                                
remote_runtime.go:116] "RunPodSandbox from runtime service failed" err="rpc error: code = Unknown desc = failed to mount container k8s_POD_kube-a
kuberuntime_sandbox.go:68] "Failed to create sandbox for pod" err="rpc error: code = Unknown desc = failed to mount container k8s_POD_kube-apiser
kuberuntime_manager.go:790] "CreatePodSandbox for pod failed" err="rpc error: code = Unknown desc = failed to mount container k8s_POD_kube-apiser
pod_workers.go:190] "Error syncing pod, skipping" err="failed to \"CreatePodSandbox\" for \"kube-apiserver-cp_kube-system(9976d2d1b70978ac90aee44
kubelet.go:2291] "Error getting node" err="node \"cp\" not found"
kubelet.go:2291] "Error getting node" err="node \"cp\" not found"
certificate_manager.go:437] Failed while requesting a signed certificate from the master: cannot create certificate signing request: Post "https:
kubelet.go:2291] "Error getting node" err="node \"cp\" not found"
kubelet.go:2291] "Error getting node" err="node \"cp\" not found"
controller.go:144] failed to ensure lease exists, will retry in 7s, error: Get "https://[net.wo.rk.ip]:6443/apis/coordination.k8s.io/v1/namespaces/
kubelet.go:2291] "Error getting node" err="node \"cp\" not found"
kubelet_node_status.go:71] "Attempting to register node" node="cp"
kubelet_node_status.go:93] "Unable to register node with API server" err="Post \"https://[net.wo.rk.ip]:6443/api/v1/nodes\": dial tcp [net.wo.rk.ip]:
kubelet.go:2291] "Error getting node" err="node \"cp\" not found"
kubelet.go:2291] "Error getting node" err="node \"cp\" not found"
event.go:273] Unable to write event: '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"cp.168a569ad02f168c"
event.go:218] Unable to write event '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"cp.168a569af27001cc",
event.go:273] Unable to write event: '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"cp.168a569ad02f1eb1"
kubelet.go:2291] "Error getting node" err="node \"cp\" not found"                          
kubelet.go:2291] "Error getting node" err="node \"cp\" not found"
controller.go:144] failed to ensure lease exists, will retry in 7s, error: Get "https://[net.wo.rk.ip]:6443/apis/coordination.k8s.io/v1/namespaces/
kubelet.go:2291] "Error getting node" err="node \"cp\" not found"
kubelet_node_status.go:71] "Attempting to register node" node="cp"
kubelet_node_status.go:93] "Unable to register node with API server" err="Post \"https://[net.wo.rk.ip]:6443/api/v1/nodes\": dial tcp [net.wo.rk.ip]:
kubelet.go:2291] "Error getting node" err="node \"cp\" not found"                                                                                 
kubelet.go:2291] "Error getting node" err="node \"cp\" not found"
eviction_manager.go:255] "Eviction manager: failed to get summary stats" err="failed to get node info: node \"cp\" not found"                    
kubelet.go:2291] "Error getting node" err="node \"cp\" not found"

If these Virtualbox setups just aren't fully supported anymore, could you let us know? I'll try to move forward with AWS or GCP in that case.

leifsegen · June 2021

In case this applies, about the same time that [wait-control-plane] line starts, the main VM window starts displaying this:

leifsegen · June 2021

And maybe I should probably include these lines at the end of the k8scp.sh output - specifically lines 1 and 26, which show the impact of the timeout error:

error execution phase wait-control-plane: couldn't initialize a Kubernetes cluster
To see the stack trace of this error execute with --v=5 or higher
Running the steps explained at the end of the init output for you
Apply Calico network plugin from ProjectCalico.org
If you see an error they may have updated the yaml file
Use a browser, navigate to the site and find the updated file
The connection to the server [net.wo.rk.ip]:6443 was refused - did you specify the right host or port?

alias sudo=sudo
alias docker=podman
--2021-06-21 14:08:48--  https://get.helm.sh/helm-v3.5.4-linux-amd64.tar.gz
Resolving get.helm.sh (get.helm.sh)... 152.195.19.97, 2606:2800:11f:1cb7:261b:1f9c:2074:3c
Connecting to get.helm.sh (get.helm.sh)|152.195.19.97|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 12378363 (12M) [application/x-tar]
Saving to: ‘helm-v3.5.4-linux-amd64.tar.gz’

helm-v3.5.4-linux-amd64.tar.gz                       100%[===================================================================================================================>]  11.80M  9.38MB/s    in 1.3s

2021-06-21 14:08:50 (9.38 MB/s) - ‘helm-v3.5.4-linux-amd64.tar.gz’ saved [12378363/12378363]


You should see this node in the output below
It can take up to a mintue for node to show Ready status

The connection to the server [net.wo.rk.ip]:6443 was refused - did you specify the right host or port?


Script finished. Move to the next step

serewicz · June 2021

Hello,

When I look at your errors I note that there are a lot of "cp" not found, but your node name seems to be net.wo.rk.ip. The kubeadm.yaml script uses cp as the hostname, perhaps this is where the issue lies. Could you edit the kubeadm.yaml file and edit the name: value, around line 15 of the file, to be your hostname. I think this is the mismatch.

Regards,

serewicz · June 2021

Hello,

Indeed. The script needs to be fixed. When the node name is cp, what I use for testing, it works. I will work to fix the script and update the course tarball asap.

Regards,

serewicz · June 2021

Well.... its more than that. I think something updated in cri-o, and it is having knock on issues. I can't get it to work regardless of the name now. This does happen with fast moving open source projects, sometimes an opps is pushed and it takes a bit to figure it out. I'll keep grinding on it, but wanted to give an update.

leifsegen · June 2021

I did attempt this with cp as the hostname when posting those errors. [net.wo.rk.ip] was just me masking the ip address of the VM. I suppose there wasn't much reason for me to do that. I replaced text was 192.168.0.23 +/- a couple integers

I'll take a look at the updates, and will take a look at your kubeadm.yaml suggestion if the latest update and the posted workaround don't work.

Thank you, @serewicz, et. al.!

leifsegen · June 2021

Setup works now!

For anyone else following a similar path to mine: Because the course says to avoid using 192.168 networks, I switch from using Bridged Network for the VirtualBox VMs, to using NAT. When I combine that with this trick to enable ssh'ing in from the host machine, it all works (so far) just as it did prior to the May course update.

olmorupert · October 2021

Also experiencing this issue with with crio-o, trying to provision the exact system from the lab.
(LFS258_V2021-09-20_SOLUTIONS.tar.xz, LFS258-labs_V2021-09-20.pdf)

2 vCPU, 8 GB, Ubuntu 18.04.6 LTS
running on vsphere, 1 interface, no swap

installed

kubeadm 1.21.1-00
kubectl 1.21.1-00
kubelet 1.21.1-00
kubernetes-cni 0.8.7-00
cri-o 1.21.3~0
cri-o-runc 1.0.1~0

configured system and crio, enabled and started, according to the latest pdf, and verified from:
https://kubernetes.io/docs/setup/production-environment/container-runtimes/#cri-o
for ubuntu 18.04

cgroup driver is systemd

/etc/hosts
172.21.90.50 te-olmo-k8m0101 k8scp

LFS258/SOLUTIONS/s_03/kubeadm-crio.yaml
updated:
podNetwork: 100.68.0.0/16

also added the described crio.conf from the lab tar to /etc/crio/crio.conf

not been able to get it working yet. digging further into crio's docu.

also the documentation doesn't mention anything about the crio.conf that exists in the lab tar.

neither aware of any k8sMaster.sh script you mentioned. but that's perhaps from an older LFS258_xxxxxx_SOLUTIONS.tar.xz

fcioanca · October 2021

@olmorupert It looks like you are enrolled in a different course, LFS258 - you may want to post there, to ensure both moderators and peer learners see your post and assist.

olmorupert · October 2021

@fcioanca said:
@olmorupert It looks like you are enrolled in a different course, LFS258 - you may want to post there, to ensure both moderators and peer learners see your post and assist.

Hi, you are correct. however, this looks like the same issue. will follow up elsewhere. thanks.

serewicz · October 2021

Will look for the post over there and respond.

Issues with running k8sMaster.sh: Swap & Time Out

Setup

Comments

Categories

Upcoming Training

Kubernetes Administration (LFS458)

Linux System Administration (LFS301)

Open Source Virtualization (LFS462)

Linux Kernel Debugging and Security (LFD440)