Welcome to the Linux Foundation Forum!

Lab 4.5 stress.yaml on worker node not working

Hello,
I've been stuck for days on this one, 'cause the stress.yaml is not working on the worker node. Without editing the file with the node selector, it runs correctly on my master node, but if I put the worker node name as selector it can be deployed but the pod remains in pending.

Probably is due the fact my worker node remains in the "not-ready" state, even though from my worker node I can run all the commands against the kube-api-server without problems. On both machines I configured 8 gb of ram and 20gb of physical space.
I disabled the firewall, the ips are static, the promiscous mode is enabled and I set a single bridge adapter with "allow all" but still wasn't able to overcome this error.
If I view the logs on the worker node I get that the node was registered correctly.

Thank you

Comments

  • chrispokorni
    chrispokorni Posts: 2,372
    edited August 21

    Hi @alessandro.cinelli,

    A cluster node in "not-ready" state as you noted, implies a cluster that has not been fully bootstrapped and/or configured, or certain readiness conditions that are no longer met as a result of unfavorable cluster events. Being able to run certain commands from the worker node against the API server is not an indication of either nodes' readiness for scheduling purposes.

    To get a better picture of you cluster please provide the outputs (as code formatted text, not as screenshots) of the following commands:

    kubectl get nodes -o wide
    kubectl get pods -A -o wide
    kubectl describe node cp-node-name
    kubectl describe node worker-node-name
    

    Regards,
    -Chris

  • Hello Chris,
    thank you for your time.


    As you may see here, most of the pods are stuck both on master and worker node


    Thank you

  • chrispokorni
    chrispokorni Posts: 2,372

    Hi @alessandro.cinelli,

    Let's cleanup your cluster by removing the basicpod Pod. Also remove the nginx, try1 and stressmeout Deployments. After their respective Pods are completely removed and resources released, please provide another output as I requested earlier from the 4 commands (as code formatted text for better readability, NOT screenshots).

    In addition, provide the output of
    kubectl get svc,ap -A

    Regards,
    -Chris

  • Hello @chrispokorni ,
    I had to force delete all the pods cause they got stuck in "terminating" state for some reason.

    kubectl get nodes -o wide

    NAME          STATUS     ROLES           AGE   VERSION   INTERNAL-IP     EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
    master-node   Ready      control-plane   12d   v1.30.1   192.168.1.247   <none>        Ubuntu 20.04.6 LTS   5.4.0-192-generic   containerd://1.7.19
    worker-node   NotReady   <none>          12d   v1.30.1   192.168.1.246   <none>        Ubuntu 20.04.6 ```
    LTS   5.4.0-192-generic   containerd://1.7.19
    

    kubectl get pods -A -o wide

    NAMESPACE     NAME                                  READY   STATUS        RESTARTS        AGE     IP              NODE          NOMINATED NODE   READINESS GATES
    default       registry-7f7bcbd5bb-kjphz             0/1     Pending       0               6m20s   <none>          <none>        <none>           <none>
    default       registry-7f7bcbd5bb-qlx6g             0/1     Error         0               17h     <none>          master-node   <none>           <none>
    default       registry-7f7bcbd5bb-x9xg5             1/1     Terminating   0               3d20h   10.0.1.13       worker-node   <none>           <none>
    kube-system   cilium-89jvg                          1/1     Running       145 (10d ago)   12d     192.168.1.246   worker-node   <none>           <none>
    kube-system   cilium-envoy-mccfq                    1/1     Running       11 (7m3s ago)   12d     192.168.1.247   master-node   <none>           <none>
    kube-system   cilium-envoy-rk8jj                    1/1     Running       3 (5d21h ago)   12d     192.168.1.246   worker-node   <none>           <none>
    kube-system   cilium-operator-7ddc48bb97-4b69m      1/1     Running       45 (7m3s ago)   12d     192.168.1.247   master-node   <none>           <none>
    kube-system   cilium-pv27t                          1/1     Running       12 (7m3s ago)   12d     192.168.1.247   master-node   <none>           <none>
    kube-system   coredns-7db6d8ff4d-jrn5s              1/1     Running       1 (7m3s ago)    17h     10.0.0.238      master-node   <none>           <none>
    kube-system   coredns-7db6d8ff4d-pmp9m              1/1     Running       1 (7m3s ago)    17h     10.0.0.90       master-node   <none>           <none>
    kube-system   etcd-master-node                      1/1     Running       17 (7m3s ago)   12d     192.168.1.247   master-node   <none>           <none>
    kube-system   kube-apiserver-master-node            1/1     Running       16 (7m3s ago)   12d     192.168.1.247   master-node   <none>           <none>
    kube-system   kube-controller-manager-master-node   1/1     Running       42 (7m3s ago)   12d     192.168.1.247   master-node   <none>           <none>
    kube-system   kube-proxy-fqxsf                      1/1     Running       11 (7m3s ago)   12d     192.168.1.247   master-node   <none>           <none>
    kube-system   kube-proxy-pjrjm                      1/1     Running       3 (5d21h ago)   12d     192.168.1.246   worker-node   <none>           <none>
    kube-system   kube-scheduler-master-node            1/1     Running       45 (7m3s ago)   12d     192.168.1.247   master-node   <none>           <none>
    
  • kubectl describe node cp-node-name

    Name:               master-node
    Roles:              control-plane
    Labels:             beta.kubernetes.io/arch=amd64
                        beta.kubernetes.io/os=linux
                        kubernetes.io/arch=amd64
                        kubernetes.io/hostname=master-node
                        kubernetes.io/os=linux
                        node-role.kubernetes.io/control-plane=
                        node.kubernetes.io/exclude-from-external-load-balancers=
    Annotations:        kubeadm.alpha.kubernetes.io/cri-socket: unix:///var/run/containerd/containerd.sock
                        node.alpha.kubernetes.io/ttl: 0
                        volumes.kubernetes.io/controller-managed-attach-detach: true
    CreationTimestamp:  Fri, 09 Aug 2024 17:07:12 +0000
    Taints:             node.kubernetes.io/disk-pressure:NoSchedule
    Unschedulable:      false
    Lease:
      HolderIdentity:  master-node
      AcquireTime:     <unset>
      RenewTime:       Thu, 22 Aug 2024 13:55:45 +0000
    Conditions:
      Type                 Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
      ----                 ------  -----------------                 ------------------                ------                       -------
      NetworkUnavailable   False   Fri, 09 Aug 2024 17:08:38 +0000   Fri, 09 Aug 2024 17:08:38 +0000   CiliumIsUp                   Cilium is running on this node
      MemoryPressure       False   Thu, 22 Aug 2024 13:55:35 +0000   Thu, 22 Aug 2024 13:45:53 +0000   KubeletHasSufficientMemory   kubelet has sufficient memory available
      DiskPressure         True    Thu, 22 Aug 2024 13:55:35 +0000   Thu, 22 Aug 2024 13:48:11 +0000   KubeletHasDiskPressure       kubelet has disk pressure
      PIDPressure          False   Thu, 22 Aug 2024 13:55:35 +0000   Thu, 22 Aug 2024 13:45:53 +0000   KubeletHasSufficientPID      kubelet has sufficient PID available
      Ready                True    Thu, 22 Aug 2024 13:55:35 +0000   Thu, 22 Aug 2024 13:46:04 +0000   KubeletReady                 kubelet is posting ready status
    Addresses:
      InternalIP:  192.168.1.247
      Hostname:    master-node
    Capacity:
      cpu:                2
      ephemeral-storage:  10218772Ki
      hugepages-2Mi:      0
      memory:             8136660Ki
      pods:               110
    Allocatable:
      cpu:                2
      ephemeral-storage:  9417620260
      hugepages-2Mi:      0
      memory:             8034260Ki
      pods:               110
    System Info:
      Machine ID:                 36c8c7fa32cb4042b079d8b23e47e39b
      System UUID:                8a163bd9-1515-0f4b-b635-f21ee64703ac
      Boot ID:                    3ef750e2-a915-4953-9a2c-15784d2a6cc8
      Kernel Version:             5.4.0-192-generic
      OS Image:                   Ubuntu 20.04.6 LTS
      Operating System:           linux
      Architecture:               amd64
      Container Runtime Version:  containerd://1.7.19
      Kubelet Version:            v1.30.1
      Kube-Proxy Version:         v1.30.1
    PodCIDR:                      10.0.0.0/24
    PodCIDRs:                     10.0.0.0/24
    Non-terminated Pods:          (10 in total)
      Namespace                   Name                                   CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
      ---------                   ----                                   ------------  ----------  ---------------  -------------  ---
      kube-system                 cilium-envoy-mccfq                     0 (0%)        0 (0%)      0 (0%)           0 (0%)         12d
      kube-system                 cilium-operator-7ddc48bb97-4b69m       0 (0%)        0 (0%)      0 (0%)           0 (0%)         12d
      kube-system                 cilium-pv27t                           100m (5%)     0 (0%)      10Mi (0%)        0 (0%)         12d
      kube-system                 coredns-7db6d8ff4d-jrn5s               100m (5%)     0 (0%)      70Mi (0%)        170Mi (2%)     17h
      kube-system                 coredns-7db6d8ff4d-pmp9m               100m (5%)     0 (0%)      70Mi (0%)        170Mi (2%)     17h
      kube-system                 etcd-master-node                       100m (5%)     0 (0%)      100Mi (1%)       0 (0%)         12d
      kube-system                 kube-apiserver-master-node             250m (12%)    0 (0%)      0 (0%)           0 (0%)         12d
      kube-system                 kube-controller-manager-master-node    200m (10%)    0 (0%)      0 (0%)           0 (0%)         12d
      kube-system                 kube-proxy-fqxsf                       0 (0%)        0 (0%)      0 (0%)           0 (0%)         12d
      kube-system                 kube-scheduler-master-node             100m (5%)     0 (0%)      0 (0%)           0 (0%)         12d
    Allocated resources:
      (Total limits may be over 100 percent, i.e., overcommitted.)
      Resource           Requests    Limits
      --------           --------    ------
      cpu                950m (47%)  0 (0%)
      memory             250Mi (3%)  340Mi (4%)
      ephemeral-storage  0 (0%)      0 (0%)
      hugepages-2Mi      0 (0%)      0 (0%)
    Events:
      Type     Reason                   Age                    From             Message
      ----     ------                   ----                   ----             -------
      Normal   Starting                 7m49s                  kube-proxy
      Normal   NodeHasSufficientMemory  10m (x23 over 22h)     kubelet          Node master-node status is now: NodeHasSufficientMemory
      Normal   NodeHasNoDiskPressure    10m (x23 over 22h)     kubelet          Node master-node status is now: NodeHasNoDiskPressure
      Normal   NodeHasSufficientPID     10m (x23 over 22h)     kubelet          Node master-node status is now: NodeHasSufficientPID
      Warning  FreeDiskSpaceFailed      9m59s                  kubelet          Failed to garbage collect required amount of images. Attempted to free 462125465 bytes, but only found 0 bytes eligible to free.
      Warning  ImageGCFailed            9m59s                  kubelet          Failed to garbage collect required amount of images. Attempted to free 462125465 bytes, but only found 0 bytes eligible to free.
      Normal   NodeReady                9m49s (x35 over 22h)   kubelet          Node master-node status is now: NodeReady
      Warning  EvictionThresholdMet     9m41s                  kubelet          Attempting to reclaim ephemeral-storage
      Normal   Starting                 7m58s                  kubelet          Starting kubelet.
      Warning  InvalidDiskCapacity      7m58s                  kubelet          invalid capacity 0 on image filesystem
      Normal   NodeHasSufficientMemory  7m58s (x8 over 7m58s)  kubelet          Node master-node status is now: NodeHasSufficientMemory
      Normal   NodeHasNoDiskPressure    7m58s (x7 over 7m58s)  kubelet          Node master-node status is now: NodeHasNoDiskPressure
      Normal   NodeHasSufficientPID     7m58s (x7 over 7m58s)  kubelet          Node master-node status is now: NodeHasSufficientPID
      Normal   NodeAllocatableEnforced  7m58s                  kubelet          Updated Node Allocatable limit across pods
      Normal   RegisteredNode           7m17s                  node-controller  Node master-node event: Registered Node master-node in Controller
      Warning  FreeDiskSpaceFailed      2m56s                  kubelet          Failed to garbage collect required amount of images. Attempted to free 614336921 bytes, but only found 0 bytes eligible to free.
    
  • kubectl describe node worker-node-name

    Name:               worker-node
    Roles:              <none>
    Labels:             beta.kubernetes.io/arch=amd64
                        beta.kubernetes.io/os=linux
                        kubernetes.io/arch=amd64
                        kubernetes.io/hostname=worker-node
                        kubernetes.io/os=linux
    Annotations:        kubeadm.alpha.kubernetes.io/cri-socket: unix:///var/run/containerd/containerd.sock
                        node.alpha.kubernetes.io/ttl: 0
                        volumes.kubernetes.io/controller-managed-attach-detach: true
    CreationTimestamp:  Fri, 09 Aug 2024 17:53:28 +0000
    Taints:             node.kubernetes.io/unreachable:NoExecute
                        node.cilium.io/agent-not-ready:NoSchedule
                        node.kubernetes.io/unreachable:NoSchedule
    Unschedulable:      false
    Lease:
      HolderIdentity:  worker-node
      AcquireTime:     <unset>
      RenewTime:       Sun, 18 Aug 2024 20:02:33 +0000
    Conditions:
      Type                 Status    LastHeartbeatTime                 LastTransitionTime                Reason              Message
      ----                 ------    -----------------                 ------------------                ------              -------
      NetworkUnavailable   False     Fri, 09 Aug 2024 17:54:51 +0000   Fri, 09 Aug 2024 17:54:51 +0000   CiliumIsUp          Cilium is running on this node
      MemoryPressure       Unknown   Sun, 18 Aug 2024 19:59:12 +0000   Sun, 18 Aug 2024 20:06:59 +0000   NodeStatusUnknown   Kubelet stopped posting node status.
      DiskPressure         Unknown   Sun, 18 Aug 2024 19:59:12 +0000   Sun, 18 Aug 2024 20:06:59 +0000   NodeStatusUnknown   Kubelet stopped posting node status.
      PIDPressure          Unknown   Sun, 18 Aug 2024 19:59:12 +0000   Sun, 18 Aug 2024 20:06:59 +0000   NodeStatusUnknown   Kubelet stopped posting node status.
      Ready                Unknown   Sun, 18 Aug 2024 19:59:12 +0000   Sun, 18 Aug 2024 20:06:59 +0000   NodeStatusUnknown   Kubelet stopped posting node status.
    Addresses:
      InternalIP:  192.168.1.246
      Hostname:    worker-node
    Capacity:
      cpu:                2
      ephemeral-storage:  10206484Ki
      hugepages-2Mi:      0
      memory:             4014036Ki
      pods:               110
    Allocatable:
      cpu:                2
      ephemeral-storage:  9406295639
      hugepages-2Mi:      0
      memory:             3911636Ki
      pods:               110
    System Info:
      Machine ID:                 082e100535c54c5986ddff0a8176ab60
      System UUID:                ffedeeca-323b-884d-a0fe-9218f3961f9a
      Boot ID:                    0ef64691-265d-4e12-bbbc-46a80c288f22
      Kernel Version:             5.4.0-192-generic
      OS Image:                   Ubuntu 20.04.6 LTS
      Operating System:           linux
      Architecture:               amd64
      Container Runtime Version:  containerd://1.7.19
      Kubelet Version:            v1.30.1
      Kube-Proxy Version:         v1.30.1
    PodCIDR:                      10.0.1.0/24
    PodCIDRs:                     10.0.1.0/24
    Non-terminated Pods:          (4 in total)
      Namespace                   Name                         CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
      ---------                   ----                         ------------  ----------  ---------------  -------------  ---
      default                     registry-7f7bcbd5bb-x9xg5    0 (0%)        0 (0%)      0 (0%)           0 (0%)         3d20h
      kube-system                 cilium-89jvg                 100m (5%)     0 (0%)      10Mi (0%)        0 (0%)         12d
      kube-system                 cilium-envoy-rk8jj           0 (0%)        0 (0%)      0 (0%)           0 (0%)         12d
      kube-system                 kube-proxy-pjrjm             0 (0%)        0 (0%)      0 (0%)           0 (0%)         12d
    Allocated resources:
      (Total limits may be over 100 percent, i.e., overcommitted.)
      Resource           Requests   Limits
      --------           --------   ------
      cpu                100m (5%)  0 (0%)
      memory             10Mi (0%)  0 (0%)
      ephemeral-storage  0 (0%)     0 (0%)
      hugepages-2Mi      0 (0%)     0 (0%)
    Events:
      Type    Reason          Age   From             Message
      ----    ------          ----  ----             -------
      Normal  RegisteredNode  8m3s  node-controller  Node worker-node event: Registered Node worker-node in Controller
    

    kubectl get svc,ap -A
    error: the server doesn't have a resource type "ap" (probably was typo?)

  • chrispokorni
    chrispokorni Posts: 2,372

    Hi @alessandro.cinelli,

    Yes, it was a typo on my part... it was meant to be svc,ep for Services and Endpoints :)

    It is unclear to me why the registry Deployment shows 3 replicas, the same as the nginx Deployment we removed earlier. Perhaps a describe of those Pods reveals why they are not in Running state; then try removing the registry Deployment as well (it may require a force deletion of some of its Pods).

    It is also odd that Kubelet is not able to complete its Garbage collection cycle to reclaim disk space. What do the Kubelet logs show on the control plane node?
    journalctl -u kubelet | less

    What images are stored on the control plane node?
    sudo podman images
    sudo crictl images

    How is disk allocated to the VMs (pre-allocated full size)?

    Regards,
    -Chris

  • Hi @chrispokorni
    journalctl -u kubelet | less
    I took the last lines
    d" interval="800ms" Aug 10 09:06:06 master-node kubelet[684]: E0810 09:06:06.611796 684 remote_runtime.go:222] "StopPodSandbox from runtime service failed" err="rpc error: code = NotFound desc = an error occurred when try to find sandbox \"fca7dd05dc29cb147dc0e5115690f4a297601411d4d8fd37919c7e9a19b4b212\": not found" podSandboxID="fca7dd05dc29cb147dc0e5115690f4a297601411d4d8fd37919c7e9a19b4b212" Aug 10 09:06:06 master-node kubelet[684]: I0810 09:06:06.613663 684 kubelet_node_status.go:73] "Attempting to register node" node="master-node" Aug 10 09:06:06 master-node kubelet[684]: E0810 09:06:06.613868 684 kubelet_node_status.go:96] "Unable to register node with API server" err="Post \"https://192.168.1.247:6443/api/v1/nodes\": dial tcp 192.168.1.247:6443: connect: connection refused" node="master-node" Aug 10 09:06:06 master-node kubelet[684]: W0810 09:06:06.681079 684 reflector.go:547] k8s.io/client-go/informers/factory.go:160: failed to list *v1.Service: Get "https://192.168.1.247:6443/api/v1/services?limit=500&resourceVersion=0": dial tcp 192.168.1.247:6443: connect: connection refused Aug 10 09:06:06 master-node kubelet[684]: E0810 09:06:06.681147 684 reflector.go:150] k8s.io/client-go/informers/factory.go:160: Failed to watch *v1.Service: failed to list *v1.Service: Get "https://192.168.1.247:6443/api/v1/services?limit=500&resourceVersion=0": dial tcp 192.168.1.247:6443: connect: connection refused Aug 10 09:06:06 master-node kubelet[684]: W0810 09:06:06.994227 684 reflector.go:547] k8s.io/client-go/informers/factory.go:160: failed to list *v1.Node: Get "https://192.168.1.247:6443/api/v1/nodes?fieldSelector=metadata.name%3Dmaster-node&limit=500&resourceVersion=0": dial tcp 192.168.1.247:6443: connect: connection refused Aug 10 09:06:06 master-node kubelet[684]: E0810 09:06:06.994318 684 reflector.go:150] k8s.io/client-go/informers/factory.go:160: Failed to watch *v1.Node: failed to list *v1.Node: Get "https://192.168.1.247:6443/api/v1/nodes?fieldSelector=metadata.name%3Dmaster-node&limit=500&resourceVersion=0": dial tcp 192.168.1.247:6443: connect: connection refused Aug 10 09:06:07 master-node kubelet[684]: W0810 09:06:07.191887 684 reflector.go:547] k8s.io/client-go/informers/factory.go:160: failed to list *v1.CSIDriver: Get "https://192.168.1.247:6443/apis/storage.k8s.io/v1/csidrivers?limit=500&resourceVersion=0": dial tcp 192.168.1.247:6443: connect: connection refused Aug 10 09:06:07 master-node kubelet[684]: E0810 09:06:07.199179 684 reflector.go:150] k8s.io/client-go/informers/factory.go:160: Failed to watch *v1.CSIDriver: failed to list *v1.CSIDriver: Get "https://192.168.1.247:6443/apis/storage.k8s.io/v1/csidrivers?limit=500&resourceVersion=0": dial tcp 192.168.1.247:6443: connect: connection refused Aug 10 09:06:07 master-node kubelet[684]: E0810 09:06:07.287256 684 controller.go:145] "Failed to ensure lease exists, will retry" err="Get \"https://192.168.1.247:6443/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/master-node?timeout=10s\": dial tcp 192.168.1.247:6443: connect: connection refused" interval="1.6s" Aug 10 09:06:07 master-node kubelet[684]: W0810 09:06:07.292373 684 reflector.go:547] k8s.io/client-go/informers/factory.go:160: failed to list *v1.RuntimeClass: Get "https://192.168.1.247:6443/apis/node.k8s.io/v1/runtimeclasses?limit=500&resourceVersion=0": dial tcp 192.168.1.247:6443: connect: connection refused Aug 10 09:06:07 master-node kubelet[684]: E0810 09:06:07.292469 684 reflector.go:150] k8s.io/client-go/informers/factory.go:160: Failed to watch *v1.RuntimeClass: failed to list *v1.RuntimeClass: Get "https://192.168.1.247:6443/apis/node.k8s.io/v1/runtimeclasses?limit=500&resourceVersion=0": dial tcp 192.168.1.247:6443: connect: connection refused Aug 10 09:06:07 master-node kubelet[684]: I0810 09:06:07.419159 684 kubelet_node_status.go:73] "Attempting to register node" node="master-node" Aug 10 09:06:07 master-node kubelet[684]: E0810 09:06:07.419953 684 kubelet_node_status.go:96] "Unable to register node with API server" err="Post \"https://192.168.1.247:6443/api/v1/nodes\": dial tcp 192.168.1.247:6443: connect: connection refused" node="master-node" Aug 10 09:06:07 master-node kubelet[684]: I0810 09:06:07.587543 684 kuberuntime_container_linux.go:167] "No swap cgroup controller present" swapBehavior="" pod="kube-system/kube-controller-manager-master-node" containerName="kube-controller-manager" Aug 10 09:06:07 master-node kubelet[684]: I0810 09:06:07.588138 684 kuberuntime_container_linux.go:167] "No swap cgroup controller present" swapBehavior="" pod="kube-system/kube-apiserver-master-node" containerName="kube-apiserver" Aug 10 09:06:07 master-node kubelet[684]: I0810 09:06:07.602022 684 kuberuntime_container_linux.go:167] "No swap cgroup controller present" swapBehavior="" pod="kube-system/etcd-master-node" containerName="etcd" Aug 10 09:06:07 master-node kubelet[684]: I0810 09:06:07.619495 684 kuberuntime_container_linux.go:167] "No swap cgroup controller present" swapBehavior="" pod="kube-system/kube-scheduler-master-node" containerName="kube-scheduler" Aug 10 09:06:08 master-node kubelet[684]: W0810 09:06:08.636848 684 reflector.go:547] k8s.io/client-go/informers/factory.go:160: failed to list *v1.Node: Get "https://192.168.1.247:6443/api/v1/nodes?fieldSelector=metadata.name%3Dmaster-node&limit=500&resourceVersion=0": dial tcp 192.168.1.247:6443: connect: connection refused Aug 10 09:06:08 master-node kubelet[684]: E0810 09:06:08.636933 684 reflector.go:150] k8s.io/client-go/informers/factory.go:160: Failed to watch *v1.Node: failed to list *v1.Node: Get "https://192.168.1.247:6443/api/v1/nodes?fieldSelector=metadata.name=master-node&amp;limit=500&amp;resourceVersion=0": dial tcp 192.168.1.247:6443: connect: connection refused
    Aug 10 09:06:08 master-node kubelet[684]: E0810 09:06:08.888865 684 controller.go:145] "Failed to ensure lease exists, will retry" err="Get \"https://192.168.1.247:6443/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/master-node?timeout=10s\": dial tcp 192.168.1.247:6443:

  • alessandro.cinelli
    alessandro.cinelli Posts: 17
    edited August 22

    sudo podman images

    WARN[0000] Using cgroups-v1 which is deprecated in favor of cgroups-v2 with Podman v5 and will be removed in a future version. Set environment variable `PODMAN_IGNORE_CGROUPSV1_WARNING` to hide this warning.
    REPOSITORY                  TAG         IMAGE ID      CREATED      SIZE
    10.97.40.62:5000/simpleapp  latest      ad2f4faa05bd  4 days ago   1.04 GB
    localhost/simpleapp         latest      ad2f4faa05bd  4 days ago   1.04 GB
    docker.io/library/python    3           0218518c77be  2 weeks ago  1.04 GB
    10.97.40.62:5000/tagtest    latest      324bc02ae123  4 weeks ago  8.08 MB
    
    

    sudo crictl images

    IMAGE                                     TAG                 IMAGE ID            SIZE
    quay.io/cilium/cilium-envoy               <none>              b9d596d6e2d4f       62.1MB
    quay.io/cilium/cilium                     <none>              1e01581279341       223MB
    quay.io/cilium/operator-generic           <none>              e7e6117055af8       31.1MB
    registry.k8s.io/coredns/coredns           v1.11.1             cbb01a7bd410d       18.2MB
    registry.k8s.io/etcd                      3.5.12-0            3861cfcd7c04c       57.2MB
    registry.k8s.io/kube-apiserver            v1.30.1             91be940803172       32.8MB
    registry.k8s.io/kube-controller-manager   v1.30.1             25a1387cdab82       31.1MB
    registry.k8s.io/kube-proxy                v1.30.1             747097150317f       29MB
    registry.k8s.io/kube-scheduler            v1.30.1             a52dc94f0a912       19.3MB
    registry.k8s.io/pause                     3.8                 4873874c08efc       311kB
    

    The disk is supposed to be dynamically allocated actually.

    Thank you again

  • chrispokorni
    chrispokorni Posts: 2,372

    Hi @alessandro.cinelli,

    The disk is supposed to be dynamically allocated actually.

    This is your problem. Kubelet only sees the allocated disk space, not what it can receive altogether.
    Please fully allocate the disk for both VMs to prevent Kubelet panics.

    Regards,
    -Chris

  • Hi @chrispokorni,
    as you suggested, I changed the disk to pre-allocated one but still getting the same issue. Is that something else I'm supposed to do?

  • chrispokorni
    chrispokorni Posts: 2,372

    Hi @alessandro.cinelli,

    Have the VMs been restarted? What are the ephemeral-storage values under Capacity and Allocatable respectively when describing the 2 nodes?

    Regards,
    -Chris

  • Hi @chrispokorni ,
    yes, the VMs have been restarted.

    Worker node:

    Capacity:
      cpu:                2
      ephemeral-storage:  10206484Ki
      hugepages-2Mi:      0
      memory:             4014036Ki
      pods:               110
    Allocatable:
      cpu:                2
      ephemeral-storage:  9406295639
      hugepages-2Mi:      0
      memory:             3911636Ki
      pods:               110
    

    Master Node

    Capacity:
      cpu:                2
      ephemeral-storage:  10218772Ki
      hugepages-2Mi:      0
      memory:             8136664Ki
      pods:               110
    Allocatable:
      cpu:                2
      ephemeral-storage:  9417620260
      hugepages-2Mi:      0
      memory:             8034264Ki
      pods:               110
    

    But if I look to the latest events in master node i can see this, and probably I shouldn't.

      Warning  InvalidDiskCapacity      5m32s                  kubelet          invalid capacity 0 on image filesystem
      Normal   NodeHasSufficientMemory  5m32s (x8 over 5m32s)  kubelet          Node master-node status is now: NodeHasSufficientMemory
      Normal   NodeHasNoDiskPressure    5m32s (x7 over 5m32s)  kubelet          Node master-node status is now: NodeHasNoDiskPressure
      Normal   NodeHasSufficientPID     5m32s (x7 over 5m32s)  kubelet          Node master-node status is now: NodeHasSufficientPID
      Normal   NodeAllocatableEnforced  5m32s                  kubelet          Updated Node Allocatable limit across pods
      Normal   RegisteredNode           2m38s                  node-controller  Node master-node event: Registered Node master-node in Controller
      Warning  FreeDiskSpaceFailed      30s                    kubelet          Failed to garbage collect required amount of images. Attempted to free 722049433 bytes, but only found 0 bytes eligible to free.
    
  • chrispokorni
    chrispokorni Posts: 2,372

    Hi @alessandro.cinelli,

    What is the size of the vdisk on each VM?
    According to your output, the vdisks seem to be about 10 GB each. The lab guide recommendation is 20+ GB per VM. For earlier Kubernetes releases 10 GB per VM used to be just enough to run the lab exercises - Kubernetes was requiring less disk space, and the container images were somewhat smaller in size.

    Regards,
    -Chris

  • Hi @chrispokorni
    It's actually 20 gbs for both master and worker node but still it says 10 gbs of space available.

  • chrispokorni
    chrispokorni Posts: 2,372

    Hi @alessandro.cinelli,

    Key here is what the kubelet node agent sees. If it sees 10 GB, it only works with 10 GB. It seems to be unaware of the additional 10 GB of storage.

    I'd be curious what does the regular user (non-root) see on the guest OS df -h --total.

    If the vdisks were extended after the OS installation, then this behavior would be expected, as the file system would be unaware of the additional vdisk space as well, requiring a file system resize to expand to the available vdisk size.

    Regards,
    -Chris

  • Hello @chrispokorni ,
    Thank you for helping me out again. I ended up doing all over again from scratch since there was something wrong with my worker vm and virtual drive but I finally managed to solve the issue and moving forward the remaining steps of the lab.

    Thank you

Categories

Upcoming Training