Lab 4.1 step 16: Cluster Upgrade is successful but Control Plane version is still at 1.28.1

trevski · April 2024

Help! My cluster upgrade and kubelet upgrade completed successfully on the cp, but doing a kubectl get node shows the cp is still at the old version level.

Output shown below. what have I missed....?

[upgrade/successful] SUCCESS! Your cluster was upgraded to "v1.29.1". Enjoy!

[upgrade/kubelet] Now that your control plane is upgraded, please proceed with upgrading your kubelets if you haven't already done so.
billy@cp:~$ kubectl get node
NAME STATUS ROLES AGE VERSION
ip-172-31-32-81 Ready 16d v1.28.1
ip-172-31-33-91 Ready,SchedulingDisabled control-plane 21d v1.28.1
billy@cp:~$ sudo apt-mark unhold kubelet kubectl
Canceled hold on kubelet.
Canceled hold on kubectl.
billy@cp:~$ sudo apt-get install -y kubelet=1.29.1-1.1 kubectl=1.29.1-1.1
Reading package lists... Done
Building dependency tree
Reading state information... Done
The following packages will be upgraded:
kubectl kubelet
2 upgraded, 0 newly installed, 0 to remove and 25 not upgraded.
Need to get 30.3 MB of archives.
After this operation, 889 kB of additional disk space will be used.
Get:1 https://prod-cdn.packages.k8s.io/repositories/isv:/kubernetes:/core:/stable:/v1.29/deb kubectl 1.29.1-1.1 [10.5 MB]
Get:2 https://prod-cdn.packages.k8s.io/repositories/isv:/kubernetes:/core:/stable:/v1.29/deb kubelet 1.29.1-1.1 [19.8 MB]
Fetched 30.3 MB in 1s (48.7 MB/s)
(Reading database ... 90923 files and directories currently installed.)
Preparing to unpack .../kubectl_1.29.1-1.1_amd64.deb ...
Unpacking kubectl (1.29.1-1.1) over (1.28.1-1.1) ...
Preparing to unpack .../kubelet_1.29.1-1.1_amd64.deb ...
Unpacking kubelet (1.29.1-1.1) over (1.28.1-1.1) ...
Setting up kubectl (1.29.1-1.1) ...
Setting up kubelet (1.29.1-1.1) ...
billy@cp:~$ sudo apt-mark hold kubelet kubectl
kubelet set on hold.
kubectl set on hold.
billy@cp:~$ sudo systemctl daemon-reload
billy@cp:~$ sudo systemctl restart kubelet
billy@cp:~$ kubectl get node
NAME STATUS ROLES AGE VERSION
ip-172-31-32-81 Ready 16d v1.28.1
ip-172-31-33-91 Ready,SchedulingDisabled control-plane 21d v1.28.1

trevski · April 2024

In the end, a reboot of the cp host has resolved the issues. The cilium pods started successfully after the reboot, and I have been able to drain the cp node and upgrade it.

trevski · April 2024

Thanks Chris. I resolved the issues by rebooting the node. I did post a message about that yesterday, but it seems to have got lost somewhere. Thanks for your help!

chrispokorni · April 2024

Hi @trevski,

It appears the apt packages list is not set with the correct URL. Please revisit steps 2 and 3 to add the correct version repository definition and gpg key, and then validate that the /etc/apt/sources.list.d/kubernetes.list file's entry coincides with the one from the lab guide. Validate this before the apt-get update in step 4, and after.

Regards,
-Chris

trevski · April 2024

Hi @chrispokorni ,

Thanks for your suggestion. I've attempted to redo all the steps but now I hit problems I think because the cp is already cordoned from before. That step just sits there for ever:

kubectl drain ip-172-31-33-91 --ignore-daemonsets
node/ip-172-31-33-91 already cordoned
Warning: ignoring DaemonSet-managed Pods: kube-system/cilium-fxlmh, kube-system/kube-proxy-2jzv9
evicting pod kube-system/cilium-operator-788c4f69bc-vrgq7

To get out of this I Ctrl-C, however now the plan and the upgrade fail:

[upgrade/health] FATAL: [preflight] Some fatal errors occurred:
[ERROR ControlPlaneNodesReady]: there are NotReady control-planes in the cluster: [ip-172-31-33-91]
[preflight] If you know what you are doing, you can make a check non-fatal with --ignore-preflight-errors=...
To see the stack trace of this error execute with --v=5 or higher

How can I reset the node or clear this up?

Referring to your original point, I checked what I did the first time and there was nothing wrong with
the packages list or the gpg key, unless I have an outdated version of the instructions or something...?

trevski · April 2024

Update: the drain is waiting for the cilium-operator pod to be deleted, but it appears to be stuck in Terminating status. I've tried force deleting it, but the cluster then spawns another pod to replace it, which gets stuck in Pending status. Aaaargh!

chrispokorni · April 2024

Hi @trevski,

Since the control plane node is already cordoned/drained, and you can no longer proceed as instructed... try to uncordon the node. The uncordon command can be found in step 17. Validate the node is in Ready state, then re-attempt the control plane upgrade process.

If the error persists you may try the suggested flag to ignore the preflight error.

unless I have an outdated version of the instructions or something

The course resources are at the latest version release, and I was basing my observation by comparing the output provided earlier with commands from the lab guide.

Regards,
-Chris

trevski · April 2024

Thanks Chris. My cp node is still in NotReady status, I think because the cilium-operator pod is stuck at Pending:
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE
kube-system cilium-operator-788c4f69bc-jn9wz 0/1 Pending 0 6m22s ip-172-31-33-91
I've tried connecting to the pod but no luck, also tried force deleting it, in which case I just get another stuck at Pending. I'll keep trying to figure it out, any suggestions welcome.
From describing the pod I can see that it has been scheduled:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 17m default-scheduler Successfully assigned kube-system/cilium-operator-788c4f69bc-jn9wz to ip-172-31-33-91

Here is the complete describe output:
Name: cilium-operator-788c4f69bc-jn9wz
Namespace: kube-system
Priority: 2000000000
Priority Class Name: system-cluster-critical
Service Account: cilium-operator
Node: ip-172-31-33-91/
Labels: app.kubernetes.io/name=cilium-operator
app.kubernetes.io/part-of=cilium
io.cilium/app=operator
name=cilium-operator
pod-template-hash=788c4f69bc
Annotations:
Status: Pending
IP:
IPs:
Controlled By: ReplicaSet/cilium-operator-788c4f69bc
Containers:
cilium-operator:
Image: quay.io/cilium/operator-generic:v1.14.1@sha256:e061de0a930534c7e3f8feda8330976367971238ccafff42659f104effd4b5f7
Port:
Host Port:
Command:
cilium-operator-generic
Args:
--config-dir=/tmp/cilium/config-map
--debug=$(CILIUM_DEBUG)
Liveness: http-get http://127.0.0.1:9234/healthz delay=60s timeout=3s period=10s #success=1 #failure=3
Readiness: http-get http://127.0.0.1:9234/healthz delay=0s timeout=3s period=5s #success=1 #failure=5
Environment:
K8S_NODE_NAME: (v1:spec.nodeName)
CILIUM_K8S_NAMESPACE: kube-system (v1:metadata.namespace)
CILIUM_DEBUG: Optional: true
Mounts:
/tmp/cilium/config-map from cilium-config-path (ro)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-x8m2s (ro)
Conditions:
Type Status
PodScheduled True
Volumes:
cilium-config-path:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: cilium-config
Optional: false
kube-api-access-x8m2s:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional:
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: kubernetes.io/os=linux
Tolerations: op=Exists
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 17m default-scheduler Successfully assigned kube-system/cilium-operator-788c4f69bc-jn9wz to ip-172-31-33-91

chrispokorni · April 2024

Hi @trevski,

What are the outputs of the following commands?

kubectl get nodes -o wide
kubectl get pods -A -o wide

Regards,
-Chris

Lab 4.1 step 16: Cluster Upgrade is successful but Control Plane version is still at 1.28.1

Output shown below. what have I missed....?

Best Answers

Answers

Categories

Upcoming Training

Kubernetes Administration (LFS458)

Linux System Administration (LFS301)

Open Source Virtualization (LFS462)

Linux Kernel Debugging and Security (LFD440)