背景
K8s
使用etcd
作为统一后端存储,在企业生产环境中,虽然可以对etcd
进行高可用部署(例如三节点master部署),但还是会存在一些情况导致etcd
集群不可用,例如:etcd
数据被误操作了或者机器损坏了。所以还是需要对etcd
进行备份还原(bur
),以便出现问题时可以回滚到以前的状态,本文给出etcd
自动化备份以及还原方案(Internal etcd cluster
)
Internal etcd cluster: It means you’re running your etcd cluster in the form of containers/pods inside the Kubernetes cluster and it is the responsibility of Kubernetes to manage those pods.
验证环境
3 master节点环境(etcd分别部署在3个k8s master
节点上):
- 192.168.10.1(master1)
- 192.168.10.2(master2)
- 192.168.10.3(master3)
备份(backup)
利用k8s cronjob
每小时备份一次:
cat <<EOF | kubectl create -f -
apiVersion: batch/v1beta1
kind: CronJob
metadata:
name: etcd-backup
namespace: kube-system
spec:
# activeDeadlineSeconds: 100
schedule: "0 * * * *"
jobTemplate:
spec:
template:
spec:
containers:
- name: etcd-backup
# Same image as in /etc/kubernetes/manifests/etcd.yaml
image: etcd-backup:3.2.24
env:
- name: ETCDCTL_API
value: "3"
args:
- /bin/sh
- /root/etcd_backup.sh
volumeMounts:
- mountPath: /etc/kubernetes/pki/etcd
name: etcd-certs
readOnly: true
- mountPath: /backup
name: backup
restartPolicy: OnFailure
hostNetwork: true
volumes:
- name: etcd-certs
hostPath:
path: /etc/kubernetes/pki/etcd
type: DirectoryOrCreate
- name: backup
hostPath:
path: /data/etcd_backup
type: DirectoryOrCreate
EOF
其中etcd-backup:3.2.24
镜像通过如下Dockerfile
产生:
FROM k8s.gcr.io/etcd:3.2.24
ADD etcd_backup.sh /root/etcd_backup.sh
etcd_backup.sh
文件如下:
etcdctl --endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
snapshot save /backup/etcd-snapshot-$(date +%Y-%m-%d_%H:%M_%Z).db
docker build -t etcd-backup:3.2.24 .
这里更好的方法是直接将etcdctl snapshot save
命令放到cronjob
中,但是由于有/backup/etcd-snapshot-$(date +%Y-%m-%d_%H:%M_%Z).db
参数,所以每次运行cronjob
时都直接渲染成k8s apply
的时间点了,我暂时还没有找到解决方案,欢迎提出更好的解决办法
还原(restore)
1、停止kube-apiserver
以及etcd
需要在三台机器上,分别执行如下命令:
mv /etc/kubernetes/manifests /etc/kubernetes/manifests.bak
docker ps |grep -E "apiserver|etcd" # 确认都没有
2、还原etcd
数据
将etcd snapshot db
分别拷贝至第二个、第三个master
节点上:
cp /data/etcd_backup/etcd-snapshot-2019-10-17_07:13_UTC.db /tmp/
rsync -avz /data/etcd_backup/etcd-snapshot-2019-10-17_07:13_UTC.db 192.168.10.2:/tmp/
rsync -avz /data/etcd_backup/etcd-snapshot-2019-10-17_07:13_UTC.db 192.168.10.3:/tmp/
在三个master节点上分别执行etcd
还原操作:
#还原master-1 etcd
mv /var/lib/etcd /var/lib/etcd.bak # 备份原有etcd数据
docker run --rm \
-v '/tmp:/backup' \
-v '/var/lib:/var/lib' \
-v '/etc/kubernetes/pki/etcd:/etc/kubernetes/pki/etcd' \
--env ETCDCTL_API=3 \
'k8s.gcr.io/etcd:3.2.24' \
/bin/sh -c "etcdctl snapshot restore \
/backup/etcd-snapshot-2019-10-17_07:13_UTC.db \
--endpoints=192.168.10.1:2379 \
--name=192.168.10.1 \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--initial-advertise-peer-urls=https://192.168.10.1:2380 \
--initial-cluster=192.168.10.1=https://192.168.10.1:2380,192.168.10.2=https://192.168.10.2:2380,192.168.10.3=https://192.168.10.3:2380 \
--data-dir=/var/lib/etcd \
--skip-hash-check=true"
#还原master-2 etcd
mv /var/lib/etcd /var/lib/etcd.bak # 备份原有etcd数据
docker run --rm \
-v '/tmp:/backup' \
-v '/var/lib:/var/lib' \
-v '/etc/kubernetes/pki/etcd:/etc/kubernetes/pki/etcd' \
--env ETCDCTL_API=3 \
'k8s.gcr.io/etcd:3.2.24' \
/bin/sh -c "etcdctl snapshot restore \
/backup/etcd-snapshot-2019-10-17_07:13_UTC.db \
--endpoints=192.168.10.2:2379 \
--name=192.168.10.2 \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--initial-advertise-peer-urls=https://192.168.10.2:2380 \
--initial-cluster=192.168.10.1=https://192.168.10.1:2380,192.168.10.2=https://192.168.10.2:2380,192.168.10.3=https://192.168.10.3:2380 \
--data-dir=/var/lib/etcd \
--skip-hash-check=true"
#还原master-3 etcd
mv /var/lib/etcd /var/lib/etcd.bak # 备份原有etcd数据
docker run --rm \
-v '/tmp:/backup' \
-v '/var/lib:/var/lib' \
-v '/etc/kubernetes/pki/etcd:/etc/kubernetes/pki/etcd' \
--env ETCDCTL_API=3 \
'k8s.gcr.io/etcd:3.2.24' \
/bin/sh -c "etcdctl snapshot restore \
/backup/etcd-snapshot-2019-10-17_07:13_UTC.db \
--endpoints=192.168.10.3:2379 \
--name=192.168.10.3 \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--initial-advertise-peer-urls=https://192.168.10.3:2380 \
--initial-cluster=192.168.10.1=https://192.168.10.1:2380,192.168.10.2=https://192.168.10.2:2380,192.168.10.3=https://192.168.10.3:2380 \
--data-dir=/var/lib/etcd \
--skip-hash-check=true"
3、恢复kube-apiserver
以及etcd
在三台机器上,分别执行如下命令,等待apiserver
和etcd
恢复:
mv /etc/kubernetes/manifests.bak /etc/kubernetes/manifests
docker ps |grep -E "apiserver|etcd" # 确认都有
4、检查是否正常
# kubectl get pods -nkube-system
5、查看etcd集群及数据校验
docker run --rm \
-v '/etc/kubernetes/pki/etcd:/etc/kubernetes/pki/etcd' \
--env ETCDCTL_API=3 \
'k8s.gcr.io/etcd:3.2.24' \
/bin/sh -c "etcdctl member list \
--endpoints=192.168.10.1:2379 \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
--cacert=/etc/kubernetes/pki/etcd/ca.crt"
docker run --rm \
-v '/etc/kubernetes/pki/etcd:/etc/kubernetes/pki/etcd' \
--env ETCDCTL_API=3 \
'k8s.gcr.io/etcd:3.2.24' \
/bin/sh -c "etcdctl get --prefix --keys-only / \
--endpoints=192.168.10.1:2379 \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
--cacert=/etc/kubernetes/pki/etcd/ca.crt"
这里etcd
的还原目前还是手动进行,后续可以考虑采用ansible
进行自动化restore
操作,至此整个etcd
备份还原的流程结束