unable go on LFS258 LAB 3.1

luigicucciolillo · July 16

hi all,
i m having big problems in the lab 3.1, in the meanwhile i m settling the cluster.
after a full reset:

sudo kubeadm reset
sudo kubeadm init

the prompt out is:

our Kubernetes control-plane has initialized successfully!
To start using your cluster, you need to run the following as a regular user:
  mkdir -p $HOME/.kube
  sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
  sudo chown $(id -u):$(id -g) $HOME/.kube/config
Alternatively, if you are the root user, you can run:
  export KUBECONFIG=/etc/kubernetes/admin.conf
You should now deploy a pod network to the cluster.
Run "kubectl apply -f [podnetwork].yaml" with one of the options listed at:
  https://kubernetes.io/docs/concepts/cluster-administration/addons/
Then you can join any number of worker nodes by running the following on each as root:
kubeadm join 192.168.193.156:6443 --token

so i wrote:

sudo su
root@luigispi:/home/luigi# export KUBECONFIG=/etc/kubernetes/admin.conf

and after continued with the step 22:

luigi@luigispi:~$ find $HOME -name cilium-cni.yaml
/home/luigi/LFS258/SOLUTIONS/s_03/cilium-cni.yaml

so, applied it:

luigi@luigispi:~$ sudo kubectl apply -f /home/luigi/LFS258/SOLUTIONS/s_03/cilium-cni.yaml
error: error validating "/home/luigi/LFS258/SOLUTIONS/s_03/cilium-cni.yaml": error validating data: failed to download openapi: Get "https://k8scp:6443/openapi/v2?timeout=32s": dial tcp 192.168.193.156:6443: connect: connection refused; if you choose to ignore these errors, turn validation off with --validate=false

hence i skipped the tests for now:

luigi@luigispi:~$ sudo kubectl apply -f /home/luigi/LFS258/SOLUTIONS/s_03/cilium-cni.yaml --validate=false
error when retrieving current configuration of:
Resource: "/v1, Resource=serviceaccounts", GroupVersionKind: "/v1, Kind=ServiceAccount"
Name: "cilium", Namespace: "kube-system"
from server for: "/home/luigi/LFS258/SOLUTIONS/s_03/cilium-cni.yaml": Get "https://k8scp:6443/api/v1/namespaces/kube-system/serviceaccounts/cilium": dial tcp 192.168.193.156:6443: connect: connection refused

...

and there is no way to go on!

tried to reset,
to set again anything,
to check if the apiserver is on,
if the IPs are ok...
no firewall in the middle...

any idea to solve it?

sudo crictl ps -a | grep kube-apiserver
WARN[0000] Config "/etc/crictl.yaml" does not exist, trying next: "/usr/bin/crictl.yaml" 
WARN[0000] runtime connect using default endpoints: [unix:///run/containerd/containerd.sock unix:///run/crio/crio.sock unix:///var/run/cri-dockerd.sock]. As the default settings are now deprecated, you should set the endpoint instead. 
WARN[0000] Image connect using default endpoints: [unix:///run/containerd/containerd.sock unix:///run/crio/crio.sock unix:///var/run/cri-dockerd.sock]. As the default settings are now deprecated, you should set the endpoint instead. 
fc8e8e0f14e1b       edd0d4592f909       9 seconds ago        Running             kube-apiserver            3                   91ee85dcd862a       kube-apiserver-luigispi            kube-system
61f63e5faf3db       edd0d4592f909       About a minute ago   Exited              kube-apiserver            2                   dba265bd82779       kube-apiserver-luigispi            kube-system

chrispokorni · July 16

Hi @luigicucciolillo,

I highly recommend respecting the commands' syntax as closely to the lab guide as possible.
After a full control plane reset execute the new init with all the options and flags as presented in the lab guide. Re-setting the kubeconfig is also essential after a new init.

A concerning aspect of your setup is the IP address of the VM (possibly the control plane node) 192.168.x.y. Overlapping VM IP addresses with the Pod subnet will eventually cause routing issues within your cluster. The are two possible solutions:
1. Reconfigure the hypervisor DHCP server to use a different IP range for VMs (for example 10.200.0.0/16) and validate or edit if necessary both the kubeadm-config.yaml and cilium-cni.yaml to have the pod subnet set to 192.168.0.0/16
2. Leave the hypervisor DHCP as is on 192.168.x.y/z range and validate or edit if necessary both the kubeadm-config.yaml and cilium-cni.yaml to have the pod subnet set to 10.200.0.0/16

Regards,
-Chris

luigicucciolillo · July 17

hi @chrispokorni ,
thank you for answering.

adjust subnets

After applying the following reset procedure, i applied the subnets you suggested (10.200.0.0/16) both on cilium-cni and kubeadm-config.
but no changes in the result .

new kubeadm command

hence, inspired by similar thread i launched again the reset procedure and used the command:

kubeadm init --pod-network-cidr 192.168.0.0/16 --node-name k8scp --upload-certs

with that i been able to complete the cluster initialization and the laboratory .

I'm attaching below the prompt collected during the troubleshooting, hopefully gonna be useful for somebody else.

If you (or somebody else) wanna spent two words for the reset procedure, would be really appreciable.

thank you for the support.

see you next problem.

Reset procedure

sudo kubeadm reset --force
sudo rm -rf /etc/cni/net.d
sudo rm -rf /var/lib/etcd
sudo rm -rf /var/lib/kubelet/*
sudo rm -rf /etc/kubernetes/*
rm -rf $HOME/.kube
sudo rm -rf /etc/cni/net.d
sudo systemctl stop kubelet
sudo systemctl stop containerd
sudo iptables -F
sudo ipvsadm --clear
sudo apt-get purge kubeadm kubectl kubelet kubernetes-cni kube*
sudo apt-get autoremove
sudo ctr containers list | awk '{print $1}' | xargs -r sudo ctr containers delete
sudo ctr snapshots list | awk '{print $1}' | xargs -r sudo ctr snapshots remove
sudo iptables -F && sudo iptables -X
sudo iptables -t nat -F && sudo iptables -t nat -X
sudo iptables -t raw -F && sudo iptables -t raw -X
sudo iptables -t mangle -F && sudo iptables -t mangle -X
sudo rm -rf /var/lib/docker /etc/docker /var/run/docker.sock
sudo rm -rf /var/lib/containerd
sudo rm -rf /run/containerd/containerd.sock
sudo rm -rf /run/docker.sock
rm -rf ~/.kube
rm -rf ~/.docker

prompts from subnets adjusting

kubeadm-config.yaml

substitute "10.200.0.0/16" instead of "192.168.0.0/16"
in /root/kubeadm-config.yaml

apiVersion: kubeadm.k8s.io/v1beta4
kind: ClusterConfiguration
kubernetesVersion: 1.32.1 # <-- Use the word stable for newest version
controlPlaneEndpoint: "k8scp:6443" #<-- Use the alias we put in /etc/hosts not >
networking:
 podSubnet: 10.200.0.0/16 #192.168.0.0/16

cilium-cni.yaml

substitute "10.200.0.0/16" instead of "192.168.0.0/16"
in /home/luigi/LFS258/SOLUTIONS/s_03/cilium-cni.yaml

  hubble-tls-key-file: /var/lib/cilium/tls/hubble/server.key
  hubble-tls-client-ca-files: /var/lib/cilium/tls/hubble/client-ca.crt
  ipam: "cluster-pool"
  ipam-cilium-node-update-rate: "15s"
  cluster-pool-ipv4-cidr: "10.200.0.0/16" #instead of "192.168.0.0/16"  
  cluster-pool-ipv4-mask-size: "24"
  egress-gateway-reconciliation-trigger-interval: "1s"
  enable-vtep: "false"

chrispokorni · July 17

Hi @luigicucciolillo,

You seem to be using conflicting configuration details in your commands.

What course are you enrolled in? Is it LFS258 Kubernetes Fundamentals or LFD259 Kubernetes for Developers?

Also, what hypervisor are you using? How many network interfaces are set per VM (and what types)? Is all inbound traffic allowed to the VMs by the hypervisor firewall (promiscuous mode enabled)?

Regards,
-Chris

luigicucciolillo · July 18

Good morning @chrispokorni ,

Actually enrolled in LFS258 Kubernetes Fundamentals,

I'm running anything on a raspberry pi 5 (16GB ram + 512 GB ssd) with ubuntu on it.

Thank you for the support

chrispokorni · July 18

Hi @luigicucciolillo,

The content has not been tested on Raspberry pi, so I cannot comment on any issues or limitations that may be related to the hardware. The content was tested on cloud VMs - Google cloud GCE, AWS EC2, Azure VM, DigitalOcean Droplet; and local VMs provisioned through VirtualBox, QEMU/KVM, VMware Workstation/Player. Each system (server, VM, etc...) that acts as a Kubernetes node needs 2 CPUs, 8 GB RAM, and 20+ GB disk. While you may be able to work with less RAM (4 or 6 GB) this may slow down your cluster.
However, I highly recommend following the installation and configuration instructions from the LFS258 lab guide. From your notes it seems you are mixing in external content, from other sources, leading to an inconsistent environment.
More specifically:

After setting the desired pod subnet (10.200.0.0/16) in the kubeadm-config.yaml manifest, use the init command as it is shown in the lab guide, step 20. The way you customized the command defeats the purpose of the earlier prep step, and subsequent configuration.
Make sure that k8scp is only configured as an alias for the control plane node, not the actual hostname.
Edit the cilium-cni.yaml manifest to use the same pod CIDR (10.200.0.0/16) set earlier for the init phase.
Mixing in steps from other threads with the instructions from the lab guide may introduce conflicting configuration in your setup, much more difficult to troubleshoot.

Regards,
-Chris

luigicucciolillo · July 18

Hi @chrispokorni ,
Resume of today's troubleshooting in the intent of grow the cluster (LAB3.2):
at new power on of the machine i found:

kubelet down
Api-server down
etcd down

after some troubleshooting, i found the swap on.
so, resettled swapoff -a, now anything (etcd/api-server/kubelet) on and ready to work.

i been learning and reading all the commands used for that troubleshooting.
will continue tomorrow with the lab.

Will follow some prompt for checks:

hostname

root@luigispi:~# hostname
luigispi

Node up

luigi@luigispi:~$ sudo kubectl get nodes
NAME    STATUS   ROLES           AGE   VERSION
k8scp   Ready    control-plane   26h   v1.32.1

checking yaml files

cilium-cni.yaml

[...]
  ipam: "cluster-pool"
  ipam-cilium-node-update-rate: "15s"
  cluster-pool-ipv4-cidr: "10.200.0.0/16" #instead of "192.168.0.0/16"
[...]

kubeadm-config.yaml

root@luigispi:~# cat kubeadm-config.yaml 
apiVersion: kubeadm.k8s.io/v1beta4
kind: ClusterConfiguration
kubernetesVersion: 1.32.1
controlPlaneEndpoint: "k8scp:6443"
networking:
  podSubnet: 10.200.0.0/16 # instead of 192.168.0.0/16

Conclusions

That troubleshooting is definitely interesting and will improve the knowledge retain and understanding,
but i m not sure will continue on the single node (same physical machine) cluster,
it takes a lot of time to troubleshoot.

What do you suggests?
Better follow the Labs and move on GCE?

Thank you for the support.

chrispokorni · July 18

Hi @luigicucciolillo,

The swapoff -a can be found in step 7 of the lab guide. Please follow carefully the exercises as they are presented in the guide.

On a different note, I still see k8scp used as a node name, instead of an alias. Running kubectl with sudo (or as root) is not recommended, and it is not how it is presented in the lab guide.

What are the outputs of:

kubectl get nodes -o wide
kubectl get pods -A -o wide

Regards,
-Chris

luigicucciolillo · July 19

Hi @chrispokorni ,
here the results of today troubleshooting, with the help of chatgpt (probably gonna move to claude soon).

At power on of the machine, etcd was again down

so, i settled a 3 terminals view (down the promt out):

terminal 1: kubectl get pods -A -o wide
terminal 2: crictl events
terminal 3: sudo lsof -i :6443

i saw the etcd container go up and down,
SYN_SENT on port 6443 and timeout during dial tcp 192.168.105.156:6443.

so i started a post mortem analysis, etcd exited, containerd collected the garbage ( ) and crashloopbackoff reported from jounalctl.

so, diving in the etcd manifest i found:

[...]
    - --initial-advertise-peer-urls=https://192.168.105.156:2380
    - --initial-cluster=k8scp=https://192.168.105.156:2380
    - --key-file=/etc/kubernetes/pki/etcd/server.key
    - --listen-client-urls=https://127.0.0.1:2379,https://192.168.105.156:2379
    - --listen-metrics-urls=http://127.0.0.1:2381
    - --listen-peer-urls=https://192.168.105.156:2380
[...]

but, ip a | grep inet returned:

[...]
    inet 192.168.237.156/24 brd 192.168.237.255 scope global dynamic noprefixroute wlan0
[...]

hence, seems i have problems with ip addressing of all the jazz...

moreover, i started the cluster (see above) with:

kubeadm init --pod-network-cidr 192.168.0.0/16 --node-name k8scp --upload-certs

not sure about that,
but my opinion is that etcd is looking for 192.168.105.156 but it does not answer because of new ip.
Am i right?
Am i missing something else?

In the attached file 'prompt-out-commands.txt' you can find all the output of the following commands :

for the 3 terminals view:

kubectl get pods -A -o wide
sudo crictl events
sudo lsof -i :6443

for the post mortem analysis:

sudo crictl ps -a | grep etcd
sudo crictl logs 45d40286716af
sudo journalctl -u kubelet -xe | grep etcd
sudo cat /etc/kubernetes/manifests/etcd.yaml
ip a | grep inet

chrispokorni · July 19

Hi @luigicucciolillo,

Consistency in the control plane endpoint is key in a Kubernetes cluster. A static IP address assigned to the instances acting as Kubernetes nodes and/or a correctly applied init command should fix some of the issues encountered.
So far the init commands applied seem to be consistently incorrect - the IP address is not the one recommended to you earlier, the node name is not the one from the lab guide (but the alias that should be just that - an alias), and there's the absence of the kubeadm-config.yaml manifest. This consistently yields improperly configured Kubernetes control plane components.
I would recommend starting from scratch, following the lab guide and applying the earlier suggestions and recommendations that were provided to you in this thread.

The purpose of lab 3 is to help the learner effortlessly initialize a working Kubernetes cluster. For this course, please avoid custom approaches to cluster initialization, as their troubleshooting is out of scope.

Regards,
-Chris

luigicucciolillo · July 24

Hi @chrispokorni ,
Sorry for the late answer...
Thank you for the support,
Definitely got your point in the reapply anything...
Probably the static IP and the correct init procedure would be enough to start the cluster.

But, just decided that will post pone the implementation in the single node.
Really interesting the troubleshooting and the "side" understanding,
But will need tons of time more,
and actually i'm running a bit out of it.

To close the post,
I also found online a guide about a single node rpi implementation,
hoping gonna be useful for somebody link here.

See you next problem!
take care...

chrispokorni · July 24

Hi @luigicucciolillo,

Keep in mind that the course was designed for a multi-node Kubernetes cluster to showcase the complexity of the networking, and the implementation of load-balancing and high-availability for application and control plane fault tolerance in a distributed multi-system environment.

Regards,
-Chris

luigicucciolillo · July 29

hi @chrispokorni ,
All good here.
VMs up and nginx running a lot
i been able to complete all the labs from chapter 3.
just one remark, i was not able to access to the control-plane' nginx from outside (point 6 of LAB 3.5).

instead, i was able to curl from localhost and from the worker node (using the clusterIP).

having a look on the web, seems that the GCP firewall is blocking the request,

so i tried a port-forward, but no luck.

any suggestions?

curl from control-plane

lc@control-plane-1-lfs258-n2:~$ curl localhost:32341
<!DOCTYPE html>
<html>
<head>
<title>Welcome to nginx!</title>
<style>
html { color-scheme: light dark; }
body { width: 35em; margin: 0 auto;
font-family: Tahoma, Verdana, Arial, sans-serif; }
</style>
</head>
<body>
<h1>Welcome to nginx!</h1>
<p>If you see this page, the nginx web server is successfully installed and
working. Further configuration is required.</p>

<p>For online documentation and support please refer to
<a href="http://nginx.org/">nginx.org</a>.<br/>
Commercial support is available at
<a href="http://nginx.com/">nginx.com</a>.</p>

<p><em>Thank you for using nginx.</em></p>
</body>
</html>

curl from worker node

lc@wn1-lfs258:~$ curl 10.198.0.2:32341
<!DOCTYPE html>
<html>
<head>
<title>Welcome to nginx!</title>
<style>
html { color-scheme: light dark; }
body { width: 35em; margin: 0 auto;
font-family: Tahoma, Verdana, Arial, sans-serif; }
</style>
</head>
<body>
<h1>Welcome to nginx!</h1>
<p>If you see this page, the nginx web server is successfully installed and
working. Further configuration is required.</p>

<p>For online documentation and support please refer to
<a href="http://nginx.org/">nginx.org</a>.<br/>
Commercial support is available at
<a href="http://nginx.com/">nginx.com</a>.</p>

<p><em>Thank you for using nginx.</em></p>
</body>

port forwarding

kubectl port-forward service/nginx 32341:80
Forwarding from 127.0.0.1:32341 -> 80
Forwarding from [::1]:32341 -> 80

chrispokorni · July 29

Hi @luigicucciolillo,

For a simple VPC firewall configuration I recommend watching the demo video for GCE from the "Course Introduction" chapter.

Regards,
-Chris

unable go on LFS258 LAB 3.1

Comments

adjust subnets

new kubeadm command

Reset procedure

prompts from subnets adjusting

kubeadm-config.yaml

cilium-cni.yaml

hostname

Node up

checking yaml files

cilium-cni.yaml

kubeadm-config.yaml

Conclusions

curl from control-plane

curl from worker node

port forwarding

Categories

Upcoming Training

Kubernetes Administration (LFS458)

Linux System Administration (LFS301)

Open Source Virtualization (LFS462)

Linux Kernel Debugging and Security (LFD440)