Welcome to the Linux Foundation Forum!

master node not available after rebooting EC2

Options

I stopped EC2 vms and after restarting master node is not available

ubuntu@ip-xxx-xxx-xx-xxx:~$ systemctl status kubelet
● kubelet.service - kubelet: The Kubernetes Node Agent
   Loaded: loaded (/lib/systemd/system/kubelet.service; enabled; vendor preset: enabled)
  Drop-In: /etc/systemd/system/kubelet.service.d
           └─10-kubeadm.conf
   Active: active (running) since Tue 2019-01-08 19:12:13 UTC; 9min ago
     Docs: https://kubernetes.io/docs/home/
 Main PID: 1288 (kubelet)
    Tasks: 21
   Memory: 95.0M
      CPU: 1min 7.658s
   CGroup: /system.slice/kubelet.service
           ├─1288 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubelet/config.yaml --cgroup-driver=cgroupfs --network-plugin=cni
           └─4535 /opt/cni/bin/calico

Jan 08 19:22:03 ip-xxx-xxx-xx-xxx kubelet[1288]: E0108 19:22:03.267101    1288 kubelet.go:2236] node "ip-xxx-xxx-xx-xxx" not found
Jan 08 19:22:03 ip-xxx-xxx-xx-xxx kubelet[1288]: E0108 19:22:03.432768    1288 kubelet.go:2236] node "ip-xxx-xxx-xx-xxx" not found
Jan 08 19:22:03 ip-xxx-xxx-xx-xxx kubelet[1288]: E0108 19:22:03.597614    1288 kubelet.go:2236] node "ip-xxx-xxx-xx-xxx" not found
Jan 08 19:22:04 ip-xxx-xxx-xx-xxx kubelet[1288]: 2019-01-08 19:22:04.089 [INFO][4535] calico.go 341: Extracted identifiers ContainerID="2af3731145389d511fb6c156e6fbcf5adb586d7290aa32f7d523ea80dceeb45b" Node="ip-xxx-xxx-xx-xxx" Orchestr
Jan 08 19:22:04 ip-xxx-xxx-xx-xxx kubelet[1288]: 2019-01-08 19:22:04.089 [INFO][4535] client.go 202: Loading config from environment
Jan 08 19:22:04 ip-xxx-xxx-xx-xxx kubelet[1288]: E0108 19:22:04.169439    1288 kubelet.go:2236] node "ip-xxx-xxx-xx-xxx" not found
Jan 08 19:22:04 ip-xxx-xxx-xx-xxx kubelet[1288]: E0108 19:22:04.334586    1288 azure_dd.go:147] failed to get azure cloud in GetVolumeLimits, plugin.host: ip-xxx-xxx-xx-xxx
Jan 08 19:22:04 ip-xxx-xxx-xx-xxx kubelet[1288]: E0108 19:22:04.824445    1288 kubelet.go:2236] node "ip-xxx-xxx-xx-xxx" not found
Jan 08 19:22:04 ip-xxx-xxx-xx-xxx kubelet[1288]: E0108 19:22:04.945939    1288 kubelet.go:2236] node "ip-xxx-xxx-xx-xxx" not found
Jan 08 19:22:05 ip-xxx-xxx-xx-xxx kubelet[1288]: E0108 19:22:05.071637    1288 kubelet.go:2236] node "ip-xxx-xxx-xx-xxx" not found

I assume it should be possible to reboot/stop&restart VMs.

Comments

  • chrispokorni
    chrispokorni Posts: 2,165
    Options

    Hi @crixo ,
    It has been a while since I worked on AWS, but I remember being able to stop instances and then start them back up when I was ready to continue with my labs.
    -Chris

  • chrispokorni
    chrispokorni Posts: 2,165
    Options

    @crixo
    I went thru lab 2.1 on 2 EC2 instances on AWS, and I had no trouble completing the lab, stopping then starting my instances. Aside from being assigned new public IPs, the node IPs remained the same, but the pod IPs have changed. I was also able to retest the service by accessing the nginx webserver via curl and browser.
    Can you provide any other details?
    Can you look into the bootstrap-config, kubelet-config or config files to see whether the master IP has changed?
    Thanks,
    -Chris

  • crixo
    crixo Posts: 31
    Options

    Hi @chrispokorni,
    I destroyed previous VM and after creating a new one I was able to reboot and continue to work w/ the cluster.
    I noticed VM went wild due to kswapd0 was using most of the cpu.
    Since the suggestion in the lab setup script to execute "sudo swapoff -a", I added into the aws VM the same command into
    /etc/rc.local to execute it after each reboot

  • serewicz
    serewicz Posts: 1,000
    Options

    AWS instances typically have swap disabled by default, the ones I've looked at and used at least. Perhaps something else? Did any of the containers restart? If there is a lot of activity and not enough resources - like when only one node is ready in a cluster - that the terminations of running containers because of OOM issues causes a stamped. After rebooting, the worker node is ready for the workload sharing and things work better.

    Glad its working now.

    Regards,

  • crixo
    crixo Posts: 31
    Options

    Hi @serewicz, Thanks a lot for clarifying. How do I check if the AWS instance has the swap disabled?
    If so, is it correct that kswapd0 is active, actually very active, among the process listed by top?

  • chrispokorni
    chrispokorni Posts: 2,165
    edited January 2019
    Options

    Hi @crixo,
    You can verify swap by running one of the following:

    cat /etc/fstab

    cat /proc/swaps

    swapon -s

    free -h

    There seems to be a known issue with kswapd0 using a lot of CPU and there are a few solutions posted online, but I have not tried either of them so I am not sure what works and what doesn't.

    Regards,
    -Chris

Categories

Upcoming Training