Etcd restore

pnts · July 2022

Hi,

This question is related to etcd backup and restore.

I've set up a kubernetes cluster using stacked etc topology using kubeadm. Dual control-plane nodes and dual worker-nodes. I end up with the static pod manifests in /etc/kubernetes/manifests and the control-plane services are running as pods. Only kubelet is running as a systemd service.

I've created a snapshot of my etcd databases on both control-plane nodes.

I simulate a data failure:
1. Stop API servers by removing the manifest files.
2. Delete /var/lib/etcd/ on both control-plane nodes.

Now I want to do a complete restore of etcd database.

I'm doing it using etcdctl on both control-plane nodes.
It seems to work and I get the output:

2022-07-01 09:05:14.228623 I | mvcc: restore compact to 42333
2022-07-01 09:05:14.233684 I | etcdserver/membership: added member a874c87fd42044f [https://127.0.0.1:2380] to cluster c9be114fc2da2776

kubectl get nodes is showing all nodes ready. However, when I do a single change, such as scaling a deployment. Something gets really wrong. Suddenly the API servers disagree on node health, and the replication change is not performed.

Am I doing this wrong?

pnts · July 2022

@oleksazhel Thank you for your willingness to help.

I think my error was to do separate etcd snapshot. One for each node. Then I restored each node from its own snapshot.

I read in etcd documentation that "all that is needed is a single snapshot “db” file" and "all members should restore using the same snapshot."
https://etcd.io/docs/v3.5/op-guide/recovery/

Doing like this made it work:

Make one snapshot from a control-plane node.
Stop etcd, api-server, kube-scheduler, kube-controller on both nodes.
Delete etcd data-dir on both nodes to simulate data loss.
Restore etcd data-dir on first node using etcdctl snapshot restore
Copy snapshot to second node and restore using etcdctl snapshot restore
Start etcd and verify etcdctl endpoint health, etcdctl endpoint status, etcdctl member list
Start api-server, kube-controller and kube-scheduler
Restart kubelet

oleksazhel · July 2022

@pnts Could you provide output of

ETCDCTL_API=3 etcdctl -w table \
--cacert /etc/kubernetes/pki/etcd/ca.crt \
--cert /etc/kubernetes/pki/etcd/server.crt \
--key /etc/kubernetes/pki/etcd/server.key \
member list

and

ETCDCTL_API=3 etcdctl -w table \
--endpoints <CP1_IP_ADDRESS>:2379,<CP2_IP_ADDRESS>:2379 \
--cacert /etc/kubernetes/pki/etcd/ca.crt \
--cert /etc/kubernetes/pki/etcd/server.crt \
--key /etc/kubernetes/pki/etcd/server.key \
endpoint status

Etcd restore

Best Answer

Answers

Categories

Upcoming Training

Kubernetes Administration (LFS458)

Linux System Administration (LFS301)

Open Source Virtualization (LFS462)

Linux Kernel Debugging and Security (LFD440)