Lab 4.1: How to restore etcd snapshot on production environment?
I've been going over lab 4.1 and I understand and can successfully backup the etcd snapshot. However, there isn't any clear information on how to actually restore. I've followed the reference to https://kubernetes.io/docs/tasks/administer-cluster/configure-upgrade-etcd/#restoring-an-etcd-cluster and it is a bit light on the steps to actually do it as well. Further, I went to https://github.com/etcd-io/etcd/blob/master/Documentation/op-guide/recovery.md#restoring-a-cluster which is better but this seems a generic page for etcd and not etcd running on kubernetes. I've tried what I found from this page (modified for my cluster)
$ ETCDCTL_API=3 etcdctl snapshot restore snapshot.db \ --name m1 \ --initial-cluster m1=http://host1:2380,m2=http://host2:2380,m3=http://host3:2380 \ --initial-cluster-token etcd-cluster-1 \ --initial-advertise-peer-urls http://host1:2380 $ ETCDCTL_API=3 etcdctl snapshot restore snapshot.db \ --name m2 \ --initial-cluster m1=http://host1:2380,m2=http://host2:2380,m3=http://host3:2380 \ --initial-cluster-token etcd-cluster-1 \ --initial-advertise-peer-urls http://host2:2380 $ ETCDCTL_API=3 etcdctl snapshot restore snapshot.db \ --name m3 \ --initial-cluster m1=http://host1:2380,m2=http://host2:2380,m3=http://host3:2380 \ --initial-cluster-token etcd-cluster-1 \ --initial-advertise-peer-urls http://host3:2380
And then it says to start etcd with new data directories which I can't seem to figure out how to do so I assume I need to edit /etc/kubernetes/manifests/etcd.yaml, but I always end up crashing the etcd pods.
Then I found this somewhere on the web for kubernetes and tried it. It also mentions that you would want to edit /etc/kubernetes/manifests/etcd.yaml to utilize the new data-dir and cluster token, but again, I end up crashing the etcd pods:
etcdctl snapshot restore etcd.db \ --endpoints=https://127.0.0.1:2379 \ --cacert=/etc/kubernetes/pki/etcd/ca.crt \ --cert=/etc/kubernetes/pki/etcd/server.crt \ --key=/etc/kubernetes/pki/etcd/server.key \ --name=controlplane \ --data-dir /var/lib/etcd-from-backup \ --initial-cluster=controlplane=https://127.0.0.1:2380 \ --initial-cluster-token=etcd-cluster-1 \ --initial-advertise-peer-urls=https://127.0.0.1:2380 \
I've got etcd running as a stacked cluster of 3 machines.
So, here are my questions:
1. What are the steps to do an actual restore for an etcd **stacked* cluster.
2. What is this --initial-cluster-token and why is it needed, I don't see it in the default etcd.yaml in manifests?
3. If doing a restore of etcd, shouldn't the DB be shutdown first, if yes, how to shut it down (I must have missed that part in the labs somewhere so thanks in advance). If shutdown not required, then how does this work in an etcd stacked cluster, if restore to one node then what do the other nodes do, wouldn't there be conflicts?
4. Question 3 brings up this: the section above showing how to do restore for kubernetes, shouldn't that be repeated on each node?
Thank you,
Jose
Comments
-
Hi @snorkelbuckle,
What was the result of starting the
etcd
with the new data directories, as instructed here? Did theetcd
pods crash after running theetcd --name m1 ...
commands?Regards,
-Chris0 -
Yes, crashed each of them...well, they status was CrashLoop something or other, and wasn't so bad because I had crashed one of them, but then I crashed all of them at same time on one attempt and everything was dead (if it can be broken, I'm the guy who can do it, all the developers pass on their "QA" approved code to me for the smoke test). I thought maybe I did something wrong or typoed something, so I've tried it about 4 times crashing each time. I've seen other sites give similar information, but it seems something is missing in the instructions or they all copied instructions from the same place. Doesn't kubernetes doc site have an actual step-by-step guide to restoring the database for etcd stacked cluster? I think the instructions probably work on just a a basic cluster non-stacked etcd. I haven't tried yet. Most of the sites make a **hand waving reference ** to "...and in a production environment, you would want to shutdown etcd before doing restore and modify etcd.yaml or other configurations to match your restored configuration. Basically, they pass off the meat of what should be done to the ether, forgetting that us nubies need a tad bit more than that.
Any help you can contribute would be much appreciated, in the meantime, I'll go back to trial and error....something will eventually stick, hopefully.
1 -
If you've come across a solution, I'd love to hear it as I've been banging my head against a wall trying to get those steps to work too
0 -
A bit late to the party here but I finally got a restore to work with Docker as container runtime.
My steps:
1. Completely shut down the control plane with kubelet running by moving all static pod manifests to a safe place (away from/etc/kubernetes/manifests
2. Once they're all gone stop kubelet.service:sudo systemctl stop kubelet.service
3. Make sure your snapshot is in a safe place away from the data dir.
4. Copy the snapshot to e.g./tmp/backup
5. Delete the data dir, likely:/var/lib/etcd
. I.e. not just themember
sub-directory.
6. Restore the snapshot by running the etcd image standalone. Note all the volume mounts: data, pki and backup:docker run --rm -t \ -v /var/lib:/var/lib \ -v /etc/kubernetes/pki/etcd:/etc/kubernetes/pki/etcd \ -v /tmp/backup:/tmp/backup \ k8s.gcr.io/etcd etcdctl \ --cacert=/etc/kubernetes/pki/etcd/ca.crt \ --cert=/etc/kubernetes/pki/etcd/server.crt \ --key=/etc/kubernetes/pki/etcd/server.key \ --data-dir=/var/lib/etcd \ --name=dmast \ --initial-advertise-peer-urls=https://192.168.122.31:2380 \ --initial-cluster=dmast=https://192.168.122.31:2380 \ snapshot restore /tmp/backup/etcd-snapshot.db
The
---initial...
and--name
arguments should come straight from your etcd.yaml static pod manifest.
7. Check that you have a new/var/lib/etcd/member
dir
8. Restore your static pod manifests back to/etc/kubernetes/manifests
9. Start kubelet againI tried running the standalone restoration container using
crictl
on a cluster with the CRI-O runtime but wasn't able to make it work. If anyone has had success with that, it'd be great to get some details on that.Hope it helps,
/Henrik0 -
Hi Henrik,
Still trying to get restore to work. I took some time off during December and January and now I'm back to trying to figure this out and not having any luck. I've reverted to trying to do this in a non-HA etcd. The best I've managed to do is get etcd restored without crashing the system, but no pods, not even the system pods like calico, coredns, etc. I tried to edit the etcd.yaml according to what very little docs I can find, but just not getting anywhere sane. At least I'm able to go back to a good instance of the cluster by simply restoring the etcd.yaml. It just seems so much can be done wrong here, I just don't know if the command line options I specify for restore are wrong or if the etcd.yaml is wrong after I restore, or maybe both!
Thanks for this info, I was not able to get far yet with this method. I get blocked by an error message from docker. I'm still learning so not sure how to resolve. I jut get this message: **Unable to find image 'k8s.gcr.io/etcd:latest' locally
docker: Error response from daemon: manifest for k8s.gcr.io/etcd:latest not found: manifest unknown: Failed to fetch "latest" from request "/v2/etcd/manifests/latest". ** If you have any idea how to resolve this. If I can at least get something to work, maybe I can work out the rest for a stacked etcd cluster.I feel that doing it the method you suggest (assuming I can get it to work) might not be the cleanest way to do it or maybe not the "preferred" way. I'd like to figure this out eventually to do this through kubectl as much as possible without resorting to backend methods that might not be portable.
0 -
Hi,
While below command is running I hit Ctrl+C. after that, all pods deleted including static pods.
after restarted the host, static pods created. not tried kubelet restart.k8s-master@master:~$ kubectl -n kube-system exec -it etcd-master -- sh -c "ETCDCTL_API=3 ETCDCTL_CACERT=/etc/kubernetes/pki/etcd/ca.crt ETCDCTL_CERT=/etc/kubernetes/pki/etcd/server.crt ETCDCTL_KEY=/etc/kubernetes/pki/etcd/server.key etcdctl --endpoints 10.0.0.2:2379 snapshot restore $HOME/backup/snapshost.db-31-10-21" ^C k8s-master@master:~$
Please help me how to get back the cluster.
0
Categories
- All Categories
- 217 LFX Mentorship
- 217 LFX Mentorship: Linux Kernel
- 788 Linux Foundation IT Professional Programs
- 352 Cloud Engineer IT Professional Program
- 177 Advanced Cloud Engineer IT Professional Program
- 82 DevOps Engineer IT Professional Program
- 146 Cloud Native Developer IT Professional Program
- 137 Express Training Courses
- 137 Express Courses - Discussion Forum
- 6.2K Training Courses
- 46 LFC110 Class Forum - Discontinued
- 70 LFC131 Class Forum
- 42 LFD102 Class Forum
- 226 LFD103 Class Forum
- 18 LFD110 Class Forum
- 37 LFD121 Class Forum
- 18 LFD133 Class Forum
- 7 LFD134 Class Forum
- 18 LFD137 Class Forum
- 71 LFD201 Class Forum
- 4 LFD210 Class Forum
- 5 LFD210-CN Class Forum
- 2 LFD213 Class Forum - Discontinued
- 128 LFD232 Class Forum - Discontinued
- 2 LFD233 Class Forum
- 4 LFD237 Class Forum
- 24 LFD254 Class Forum
- 694 LFD259 Class Forum
- 111 LFD272 Class Forum
- 4 LFD272-JP クラス フォーラム
- 12 LFD273 Class Forum
- 145 LFS101 Class Forum
- 1 LFS111 Class Forum
- 3 LFS112 Class Forum
- 2 LFS116 Class Forum
- 4 LFS118 Class Forum
- 6 LFS142 Class Forum
- 5 LFS144 Class Forum
- 4 LFS145 Class Forum
- 2 LFS146 Class Forum
- 3 LFS147 Class Forum
- 1 LFS148 Class Forum
- 15 LFS151 Class Forum
- 2 LFS157 Class Forum
- 25 LFS158 Class Forum
- 7 LFS162 Class Forum
- 2 LFS166 Class Forum
- 4 LFS167 Class Forum
- 3 LFS170 Class Forum
- 2 LFS171 Class Forum
- 3 LFS178 Class Forum
- 3 LFS180 Class Forum
- 2 LFS182 Class Forum
- 5 LFS183 Class Forum
- 31 LFS200 Class Forum
- 737 LFS201 Class Forum - Discontinued
- 3 LFS201-JP クラス フォーラム
- 18 LFS203 Class Forum
- 130 LFS207 Class Forum
- 2 LFS207-DE-Klassenforum
- 1 LFS207-JP クラス フォーラム
- 302 LFS211 Class Forum
- 56 LFS216 Class Forum
- 52 LFS241 Class Forum
- 48 LFS242 Class Forum
- 38 LFS243 Class Forum
- 15 LFS244 Class Forum
- 2 LFS245 Class Forum
- LFS246 Class Forum
- 48 LFS250 Class Forum
- 2 LFS250-JP クラス フォーラム
- 1 LFS251 Class Forum
- 151 LFS253 Class Forum
- 1 LFS254 Class Forum
- 1 LFS255 Class Forum
- 7 LFS256 Class Forum
- 1 LFS257 Class Forum
- 1.2K LFS258 Class Forum
- 10 LFS258-JP クラス フォーラム
- 118 LFS260 Class Forum
- 159 LFS261 Class Forum
- 42 LFS262 Class Forum
- 82 LFS263 Class Forum - Discontinued
- 15 LFS264 Class Forum - Discontinued
- 11 LFS266 Class Forum - Discontinued
- 24 LFS267 Class Forum
- 22 LFS268 Class Forum
- 30 LFS269 Class Forum
- LFS270 Class Forum
- 202 LFS272 Class Forum
- 2 LFS272-JP クラス フォーラム
- 1 LFS274 Class Forum
- 4 LFS281 Class Forum
- 9 LFW111 Class Forum
- 259 LFW211 Class Forum
- 181 LFW212 Class Forum
- 13 SKF100 Class Forum
- 1 SKF200 Class Forum
- 1 SKF201 Class Forum
- 795 Hardware
- 199 Drivers
- 68 I/O Devices
- 37 Monitors
- 102 Multimedia
- 174 Networking
- 91 Printers & Scanners
- 85 Storage
- 758 Linux Distributions
- 82 Debian
- 67 Fedora
- 17 Linux Mint
- 13 Mageia
- 23 openSUSE
- 148 Red Hat Enterprise
- 31 Slackware
- 13 SUSE Enterprise
- 353 Ubuntu
- 468 Linux System Administration
- 39 Cloud Computing
- 71 Command Line/Scripting
- Github systems admin projects
- 93 Linux Security
- 78 Network Management
- 102 System Management
- 47 Web Management
- 63 Mobile Computing
- 18 Android
- 33 Development
- 1.2K New to Linux
- 1K Getting Started with Linux
- 370 Off Topic
- 114 Introductions
- 173 Small Talk
- 22 Study Material
- 805 Programming and Development
- 303 Kernel Development
- 484 Software Development
- 1.8K Software
- 261 Applications
- 183 Command Line
- 3 Compiling/Installing
- 987 Games
- 317 Installation
- 96 All In Program
- 96 All In Forum
Upcoming Training
-
August 20, 2018
Kubernetes Administration (LFS458)
-
August 20, 2018
Linux System Administration (LFS301)
-
August 27, 2018
Open Source Virtualization (LFS462)
-
August 27, 2018
Linux Kernel Debugging and Security (LFD440)