Welcome to the Linux Foundation Forum!

Lab 4.1: How to restore etcd snapshot on production environment?

I've been going over lab 4.1 and I understand and can successfully backup the etcd snapshot. However, there isn't any clear information on how to actually restore. I've followed the reference to https://kubernetes.io/docs/tasks/administer-cluster/configure-upgrade-etcd/#restoring-an-etcd-cluster and it is a bit light on the steps to actually do it as well. Further, I went to https://github.com/etcd-io/etcd/blob/master/Documentation/op-guide/recovery.md#restoring-a-cluster which is better but this seems a generic page for etcd and not etcd running on kubernetes. I've tried what I found from this page (modified for my cluster)

$ ETCDCTL_API=3 etcdctl snapshot restore snapshot.db \
  --name m1 \
  --initial-cluster m1=http://host1:2380,m2=http://host2:2380,m3=http://host3:2380 \
  --initial-cluster-token etcd-cluster-1 \
  --initial-advertise-peer-urls http://host1:2380
$ ETCDCTL_API=3 etcdctl snapshot restore snapshot.db \
  --name m2 \
  --initial-cluster m1=http://host1:2380,m2=http://host2:2380,m3=http://host3:2380 \
  --initial-cluster-token etcd-cluster-1 \
  --initial-advertise-peer-urls http://host2:2380
$ ETCDCTL_API=3 etcdctl snapshot restore snapshot.db \
  --name m3 \
  --initial-cluster m1=http://host1:2380,m2=http://host2:2380,m3=http://host3:2380 \
  --initial-cluster-token etcd-cluster-1 \
  --initial-advertise-peer-urls http://host3:2380

And then it says to start etcd with new data directories which I can't seem to figure out how to do so I assume I need to edit /etc/kubernetes/manifests/etcd.yaml, but I always end up crashing the etcd pods.

Then I found this somewhere on the web for kubernetes and tried it. It also mentions that you would want to edit /etc/kubernetes/manifests/etcd.yaml to utilize the new data-dir and cluster token, but again, I end up crashing the etcd pods:

etcdctl snapshot restore etcd.db \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
--name=controlplane \
--data-dir /var/lib/etcd-from-backup \
--initial-cluster=controlplane=https://127.0.0.1:2380 \
--initial-cluster-token=etcd-cluster-1 \
--initial-advertise-peer-urls=https://127.0.0.1:2380 \

I've got etcd running as a stacked cluster of 3 machines.

So, here are my questions:
1. What are the steps to do an actual restore for an etcd **stacked* cluster.
2. What is this --initial-cluster-token and why is it needed, I don't see it in the default etcd.yaml in manifests?
3. If doing a restore of etcd, shouldn't the DB be shutdown first, if yes, how to shut it down (I must have missed that part in the labs somewhere so thanks in advance). If shutdown not required, then how does this work in an etcd stacked cluster, if restore to one node then what do the other nodes do, wouldn't there be conflicts?
4. Question 3 brings up this: the section above showing how to do restore for kubernetes, shouldn't that be repeated on each node?

Thank you,
Jose

Comments

  • Hi @snorkelbuckle,

    What was the result of starting the etcd with the new data directories, as instructed here? Did the etcd pods crash after running the etcd --name m1 ... commands?

    Regards,
    -Chris

  • Yes, crashed each of them...well, they status was CrashLoop something or other, and wasn't so bad because I had crashed one of them, but then I crashed all of them at same time on one attempt and everything was dead (if it can be broken, I'm the guy who can do it, all the developers pass on their "QA" approved code to me for the smoke test). I thought maybe I did something wrong or typoed something, so I've tried it about 4 times crashing each time. I've seen other sites give similar information, but it seems something is missing in the instructions or they all copied instructions from the same place. Doesn't kubernetes doc site have an actual step-by-step guide to restoring the database for etcd stacked cluster? I think the instructions probably work on just a a basic cluster non-stacked etcd. I haven't tried yet. Most of the sites make a **hand waving reference ** to "...and in a production environment, you would want to shutdown etcd before doing restore and modify etcd.yaml or other configurations to match your restored configuration. Basically, they pass off the meat of what should be done to the ether, forgetting that us nubies need a tad bit more than that.

    Any help you can contribute would be much appreciated, in the meantime, I'll go back to trial and error....something will eventually stick, hopefully.

  • If you've come across a solution, I'd love to hear it as I've been banging my head against a wall trying to get those steps to work too

  • A bit late to the party here but I finally got a restore to work with Docker as container runtime.

    My steps:
    1. Completely shut down the control plane with kubelet running by moving all static pod manifests to a safe place (away from /etc/kubernetes/manifests
    2. Once they're all gone stop kubelet.service: sudo systemctl stop kubelet.service
    3. Make sure your snapshot is in a safe place away from the data dir.
    4. Copy the snapshot to e.g. /tmp/backup
    5. Delete the data dir, likely: /var/lib/etcd. I.e. not just the member sub-directory.
    6. Restore the snapshot by running the etcd image standalone. Note all the volume mounts: data, pki and backup:

    docker run --rm -t \
    -v /var/lib:/var/lib \
    -v /etc/kubernetes/pki/etcd:/etc/kubernetes/pki/etcd \
    -v /tmp/backup:/tmp/backup \
    k8s.gcr.io/etcd etcdctl \
    --cacert=/etc/kubernetes/pki/etcd/ca.crt \
    --cert=/etc/kubernetes/pki/etcd/server.crt \
    --key=/etc/kubernetes/pki/etcd/server.key \
    --data-dir=/var/lib/etcd \
    --name=dmast \
    --initial-advertise-peer-urls=https://192.168.122.31:2380 \
    --initial-cluster=dmast=https://192.168.122.31:2380 \
    snapshot restore /tmp/backup/etcd-snapshot.db
    

    The ---initial... and --name arguments should come straight from your etcd.yaml static pod manifest.
    7. Check that you have a new /var/lib/etcd/member dir
    8. Restore your static pod manifests back to /etc/kubernetes/manifests
    9. Start kubelet again

    I tried running the standalone restoration container using crictl on a cluster with the CRI-O runtime but wasn't able to make it work. If anyone has had success with that, it'd be great to get some details on that.

    Hope it helps,
    /Henrik

  • snorkelbuckle
    snorkelbuckle Posts: 7
    edited February 2021

    Hi Henrik,

    Still trying to get restore to work. I took some time off during December and January and now I'm back to trying to figure this out and not having any luck. I've reverted to trying to do this in a non-HA etcd. The best I've managed to do is get etcd restored without crashing the system, but no pods, not even the system pods like calico, coredns, etc. I tried to edit the etcd.yaml according to what very little docs I can find, but just not getting anywhere sane. At least I'm able to go back to a good instance of the cluster by simply restoring the etcd.yaml. It just seems so much can be done wrong here, I just don't know if the command line options I specify for restore are wrong or if the etcd.yaml is wrong after I restore, or maybe both!

    Thanks for this info, I was not able to get far yet with this method. I get blocked by an error message from docker. I'm still learning so not sure how to resolve. I jut get this message: **Unable to find image 'k8s.gcr.io/etcd:latest' locally
    docker: Error response from daemon: manifest for k8s.gcr.io/etcd:latest not found: manifest unknown: Failed to fetch "latest" from request "/v2/etcd/manifests/latest". ** If you have any idea how to resolve this. If I can at least get something to work, maybe I can work out the rest for a stacked etcd cluster.

    I feel that doing it the method you suggest (assuming I can get it to work) might not be the cleanest way to do it or maybe not the "preferred" way. I'd like to figure this out eventually to do this through kubectl as much as possible without resorting to backend methods that might not be portable.

  • rosaiah
    rosaiah Posts: 3
    edited October 2021

    Hi,
    While below command is running I hit Ctrl+C. after that, all pods deleted including static pods.
    after restarted the host, static pods created. not tried kubelet restart.

    k8s-master@master:~$ kubectl -n kube-system exec -it etcd-master -- sh -c "ETCDCTL_API=3 ETCDCTL_CACERT=/etc/kubernetes/pki/etcd/ca.crt ETCDCTL_CERT=/etc/kubernetes/pki/etcd/server.crt ETCDCTL_KEY=/etc/kubernetes/pki/etcd/server.key etcdctl --endpoints 10.0.0.2:2379 snapshot restore $HOME/backup/snapshost.db-31-10-21"
    ^C
    k8s-master@master:~$
    

    Please help me how to get back the cluster.

Categories

Upcoming Training