Disaster Recovery
Most teams that think they have a Kubernetes disaster recovery plan actually have a backup plan. These are not the same thing. A backup is a snapshot of state. A disaster recovery plan is a tested procedure that takes a destroyed cluster to a running state within a defined time window, with a defined acceptable data loss window. The difference is the word "tested."
I have spoken with teams that had Velero running for a year and had never done a restore. They had backups. They did not have disaster recovery. When I asked what their RTO was, they said "a few hours." When I asked how they knew, the conversation became uncomfortable.
What You Are Actually Trying to Recover
Understanding what needs to be backed up requires understanding what state Kubernetes has:
Recovering Kubernetes means recovering all three categories. Most teams focus on the first and forget that Secrets in etcd are their authentication credentials, that CRDs must be restored before their custom resources, and that PersistentVolumes hold production data that may not be in a managed database.
etcd Backups: Only for Self-Managed Clusters
If you are on EKS, GKE, or AKS, the cloud provider manages etcd. You do not have direct access to the control plane, and the provider handles etcd HA and backups. Your disaster recovery model for the control plane is: the managed service maintains the state; your recovery path is recreating the cluster from infrastructure code and restoring workload configuration.
If you are running kubeadm-based self-managed clusters, etcd backups are your responsibility.
# etcd snapshot on a kubeadm cluster
ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-snapshot-$(date +%Y%m%d-%H%M%S).db \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
--key=/etc/kubernetes/pki/etcd/healthcheck-client.key
# Verify the snapshot
ETCDCTL_API=3 etcdctl snapshot status /backup/etcd-snapshot-*.db --write-out=tableFor managed clusters, the "backup" equivalent is ensuring your cluster configuration is fully expressed in infrastructure code (Terraform, eksctl YAML, Pulumi). If you can run terraform apply and get a cluster back, your control plane recovery path is defined.
Velero: What It Actually Does
Velero backs up Kubernetes object definitions and PersistentVolume data. It exports the YAML representations of your Kubernetes resources to an object store (S3, GCS) and snapshots PersistentVolumes using cloud provider snapshot APIs.
# Install Velero on EKS with S3 backend
velero install \
--provider aws \
--plugins velero/velero-plugin-for-aws:v1.9.0 \
--bucket my-velero-backups \
--backup-location-config region=us-east-1 \
--snapshot-location-config region=us-east-1 \
--sa-annotations eks.amazonaws.com/role-arn=arn:aws:iam::123456789012:role/velero-role# Schedule — daily backup of all namespaces, 30-day retention
apiVersion: velero.io/v1
kind: Schedule
metadata:
name: daily-backup
namespace: velero
spec:
schedule: "0 2 * * *" # 2 AM daily
template:
storageLocation: default
ttl: 720h # 30 days
includedNamespaces:
- "*"
excludedNamespaces:
- kube-system
- velero
- monitoring
includeClusterResources: true # Include CRDs, ClusterRoles, etc.
snapshotVolumes: true
volumeSnapshotLocations:
- defaultThe critical configuration is includeClusterResources: true. Without it, CRDs are not backed up, and restoring CRDs after custom resources will fail because the API does not exist yet.
Restore Order Matters
Velero restore is not as simple as velero restore create. The order of restore matters:
# Restore to a new cluster from a specific backup
velero restore create payments-restore \
--from-backup daily-backup-20250826 \
--include-namespaces payments,orders \
--restore-volumes true \
--wait
# Check restore status
velero restore describe payments-restore
velero restore logs payments-restoreSecrets are a restore complication. Velero backs them up encrypted in S3, but if your workloads use IRSA or external-secrets-operator, many Secrets are dynamically generated and do not need to be restored — the pods will re-fetch them from Secrets Manager on startup. Restoring stale Secrets on top of dynamically managed ones can cause authentication failures. Know which Secrets are static vs. dynamic before your restore procedure.
The Cluster Rebuild Drill
The only way to validate your RTO is to measure it. Once per quarter — or at minimum twice a year — run a controlled rebuild:
- Take a Velero backup snapshot manually
- Provision a new cluster in a test account using your Terraform code
- Restore the Velero backup into the new cluster
- Verify that critical workloads are healthy
- Measure the time from step 2 to step 4
Document the actual time. Compare it to your stated RTO. If your RTO is 4 hours and the drill takes 6 hours, you have a gap. Find the bottleneck — usually it is EBS snapshot restore times or ordering issues with CRD-dependent resources.
# Velero backup that you can actually test restoring
velero backup create dr-drill-$(date +%Y%m%d) \
--include-namespaces payments,orders,production \
--include-cluster-resources=true \
--snapshot-volumes=true \
--wait
# After new cluster is up, restore
velero restore create dr-drill-restore \
--from-backup dr-drill-$(date +%Y%m%d) \
--restore-volumes=true \
--wait
# Validation script — check critical deployments are healthy
for ns in payments orders production; do
echo "=== Namespace: $ns ==="
kubectl get deployments -n $ns -o wide
kubectl get pods -n $ns | grep -v Running | grep -v Completed
doneBackup Validation: The Test You Are Not Running
Running a backup is not the same as having a recoverable backup. Velero can silently fail to snapshot PersistentVolumes if the snapshot location is misconfigured, or back up Secrets as empty if it lacks the right permissions.
# Check that backups are completing successfully and snapshots are being taken
velero backup get
# Inspect a specific backup for warnings
velero backup describe daily-backup-20250826 --details
# Verify volume snapshots were created
velero backup describe daily-backup-20250826 | grep -A 5 "Volume Snapshots"Add a weekly alert: if the most recent backup is older than 25 hours, page the on-call. This is a five-minute configuration in your alerting system and catches cases where the backup schedule silently stops working.
Key Takeaways
- A backup is not a disaster recovery plan. DR requires a tested procedure with measured RTO and RPO, not just a scheduled Velero job that nobody has ever restored.
- For managed clusters (EKS, GKE, AKS), the control plane backup strategy is: express all cluster configuration in infrastructure code. The provider manages etcd. Your "restore" is
terraform applyfollowed by a Velero restore. - Velero backup must include
includeClusterResources: trueto capture CRDs and ClusterRoles. Restoring custom resources without their CRDs fails. - The restore order matters: CRDs first, then namespaces and RBAC, then Secrets and ConfigMaps, then Deployments, then PVCs. Velero handles most of this, but dynamic Secrets (from external-secrets-operator) should not be restored from backup.
- Run a cluster rebuild drill quarterly. Measure the actual time to recovery and compare it to your stated RTO. The gap between stated and actual is your disaster risk.
- Alert if the most recent successful backup is older than 25 hours. Silent backup failures are common and discovered at the worst possible time.