Recommendation on 20GB+ per node (VM) disk space to avoid "Evicted pod death-spiral"

malloc_failed · October 2023

Hopefully this will help others who have taken their time going through the LFS258 material over a longer period of time, and ended up with dozens of pods in "Evicted" state around Lab 11.1, 11.2, or after there.

Any long-running pods from earlier in the lab builds will gradually increase their local disk usage over time, until ultimately the nodes will "Evict" any newly-scheduled pods. This results in a "death-spiral" because the only way to reclaim the space completely is to undeploy ~everything. Since the labs use the CP node as an additional Worker node, this impacts ~everything else.

One solution is to simply not leave your environment running overnight after linkerd & linkerd-viz are deployed, since it's the first Lab that really consumes a lot of space over time.

Alternatively, increase the disk space on your VMs up to at least 20GB, and that should give you enough buffer to continue through the end of all labs. (Search "growpart" and "lvextend" online, and "qemu-img resize" if you're on KVM.) In my case, adding the ingress controller pushed things into the endless spiral of ephemeral disk space being claimed then evicted, and thus it was far faster to resize the VMs and "throw space at the problem" rather than reconfigure external volumes, edit logging, etc.

Hopefully that helps someone else too.

Recommendation on 20GB+ per node (VM) disk space to avoid "Evicted pod death-spiral"

Categories

Upcoming Training

Kubernetes Administration (LFS458)

Linux System Administration (LFS301)

Open Source Virtualization (LFS462)

Linux Kernel Debugging and Security (LFD440)