Welcome to the Linux Foundation Forum!

Etcd restore


This question is related to etcd backup and restore.

I've set up a kubernetes cluster using stacked etc topology using kubeadm. Dual control-plane nodes and dual worker-nodes. I end up with the static pod manifests in /etc/kubernetes/manifests and the control-plane services are running as pods. Only kubelet is running as a systemd service.

I've created a snapshot of my etcd databases on both control-plane nodes.

I simulate a data failure:
1. Stop API servers by removing the manifest files.
2. Delete /var/lib/etcd/ on both control-plane nodes.

Now I want to do a complete restore of etcd database.

I'm doing it using etcdctl on both control-plane nodes.
It seems to work and I get the output:

2022-07-01 09:05:14.228623 I | mvcc: restore compact to 42333
2022-07-01 09:05:14.233684 I | etcdserver/membership: added member a874c87fd42044f [] to cluster c9be114fc2da2776

kubectl get nodes is showing all nodes ready. However, when I do a single change, such as scaling a deployment. Something gets really wrong. Suddenly the API servers disagree on node health, and the replication change is not performed.

Am I doing this wrong?

Best Answer

  • pnts
    pnts Posts: 33
    Answer ✓

    @oleksazhel Thank you for your willingness to help.

    I think my error was to do separate etcd snapshot. One for each node. Then I restored each node from its own snapshot.

    I read in etcd documentation that "all that is needed is a single snapshot “db” file" and "all members should restore using the same snapshot."

    Doing like this made it work:

    1. Make one snapshot from a control-plane node.
    2. Stop etcd, api-server, kube-scheduler, kube-controller on both nodes.
    3. Delete etcd data-dir on both nodes to simulate data loss.
    4. Restore etcd data-dir on first node using etcdctl snapshot restore
    5. Copy snapshot to second node and restore using etcdctl snapshot restore
    6. Start etcd and verify etcdctl endpoint health, etcdctl endpoint status, etcdctl member list
    7. Start api-server, kube-controller and kube-scheduler
    8. Restart kubelet


  • oleksazhel
    oleksazhel Posts: 57
    edited July 2022

    @pnts Could you provide output of

    ETCDCTL_API=3 etcdctl -w table \
    --cacert /etc/kubernetes/pki/etcd/ca.crt \
    --cert /etc/kubernetes/pki/etcd/server.crt \
    --key /etc/kubernetes/pki/etcd/server.key \
    member list


    ETCDCTL_API=3 etcdctl -w table \
    --endpoints <CP1_IP_ADDRESS>:2379,<CP2_IP_ADDRESS>:2379 \
    --cacert /etc/kubernetes/pki/etcd/ca.crt \
    --cert /etc/kubernetes/pki/etcd/server.crt \
    --key /etc/kubernetes/pki/etcd/server.key \
    endpoint status


Upcoming Training