Skip to main content
Kubernetes Pragmatic

Disaster Recovery

Ravinder··7 min read
KubernetesCloud NativeDevOpsDisaster RecoveryVeleroetcd
Share:
Disaster Recovery

Most teams that think they have a Kubernetes disaster recovery plan actually have a backup plan. These are not the same thing. A backup is a snapshot of state. A disaster recovery plan is a tested procedure that takes a destroyed cluster to a running state within a defined time window, with a defined acceptable data loss window. The difference is the word "tested."

I have spoken with teams that had Velero running for a year and had never done a restore. They had backups. They did not have disaster recovery. When I asked what their RTO was, they said "a few hours." When I asked how they knew, the conversation became uncomfortable.

What You Are Actually Trying to Recover

Understanding what needs to be backed up requires understanding what state Kubernetes has:

graph LR subgraph Control["Control Plane State"] ETCD[etcd — all K8s objects] ETCD --> DEP[Deployments] ETCD --> SVC[Services] ETCD --> SEC[Secrets] ETCD --> CM[ConfigMaps] ETCD --> RBAC[RBAC rules] ETCD --> CRD[CRDs and CRs] end subgraph Data["Workload State"] PV[PersistentVolumes — EBS / GCE PD] DB[External Databases — RDS / Cloud SQL] end subgraph Infra["Infrastructure"] TF[Terraform state — cluster itself] REG[Container Registry — images] end

Recovering Kubernetes means recovering all three categories. Most teams focus on the first and forget that Secrets in etcd are their authentication credentials, that CRDs must be restored before their custom resources, and that PersistentVolumes hold production data that may not be in a managed database.

etcd Backups: Only for Self-Managed Clusters

If you are on EKS, GKE, or AKS, the cloud provider manages etcd. You do not have direct access to the control plane, and the provider handles etcd HA and backups. Your disaster recovery model for the control plane is: the managed service maintains the state; your recovery path is recreating the cluster from infrastructure code and restoring workload configuration.

If you are running kubeadm-based self-managed clusters, etcd backups are your responsibility.

# etcd snapshot on a kubeadm cluster
ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-snapshot-$(date +%Y%m%d-%H%M%S).db \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
  --key=/etc/kubernetes/pki/etcd/healthcheck-client.key
 
# Verify the snapshot
ETCDCTL_API=3 etcdctl snapshot status /backup/etcd-snapshot-*.db --write-out=table

For managed clusters, the "backup" equivalent is ensuring your cluster configuration is fully expressed in infrastructure code (Terraform, eksctl YAML, Pulumi). If you can run terraform apply and get a cluster back, your control plane recovery path is defined.

Velero: What It Actually Does

Velero backs up Kubernetes object definitions and PersistentVolume data. It exports the YAML representations of your Kubernetes resources to an object store (S3, GCS) and snapshots PersistentVolumes using cloud provider snapshot APIs.

# Install Velero on EKS with S3 backend
velero install \
  --provider aws \
  --plugins velero/velero-plugin-for-aws:v1.9.0 \
  --bucket my-velero-backups \
  --backup-location-config region=us-east-1 \
  --snapshot-location-config region=us-east-1 \
  --sa-annotations eks.amazonaws.com/role-arn=arn:aws:iam::123456789012:role/velero-role
# Schedule — daily backup of all namespaces, 30-day retention
apiVersion: velero.io/v1
kind: Schedule
metadata:
  name: daily-backup
  namespace: velero
spec:
  schedule: "0 2 * * *"  # 2 AM daily
  template:
    storageLocation: default
    ttl: 720h  # 30 days
    includedNamespaces:
      - "*"
    excludedNamespaces:
      - kube-system
      - velero
      - monitoring
    includeClusterResources: true  # Include CRDs, ClusterRoles, etc.
    snapshotVolumes: true
    volumeSnapshotLocations:
      - default

The critical configuration is includeClusterResources: true. Without it, CRDs are not backed up, and restoring CRDs after custom resources will fail because the API does not exist yet.

Restore Order Matters

Velero restore is not as simple as velero restore create. The order of restore matters:

sequenceDiagram participant OPS as Operator participant TF as Terraform participant VEL as Velero participant CLU as New Cluster OPS->>TF: terraform apply (recreate cluster infra) TF-->>CLU: Cluster created, addons installed OPS->>VEL: velero restore create --from-backup daily-backup VEL->>CLU: Restore CRDs first VEL->>CLU: Restore Namespaces + RBAC VEL->>CLU: Restore Secrets + ConfigMaps VEL->>CLU: Restore Deployments + StatefulSets VEL->>CLU: Restore PVCs + trigger EBS snapshot restores CLU-->>OPS: Verify pods running
# Restore to a new cluster from a specific backup
velero restore create payments-restore \
  --from-backup daily-backup-20250826 \
  --include-namespaces payments,orders \
  --restore-volumes true \
  --wait
 
# Check restore status
velero restore describe payments-restore
velero restore logs payments-restore

Secrets are a restore complication. Velero backs them up encrypted in S3, but if your workloads use IRSA or external-secrets-operator, many Secrets are dynamically generated and do not need to be restored — the pods will re-fetch them from Secrets Manager on startup. Restoring stale Secrets on top of dynamically managed ones can cause authentication failures. Know which Secrets are static vs. dynamic before your restore procedure.

The Cluster Rebuild Drill

The only way to validate your RTO is to measure it. Once per quarter — or at minimum twice a year — run a controlled rebuild:

  1. Take a Velero backup snapshot manually
  2. Provision a new cluster in a test account using your Terraform code
  3. Restore the Velero backup into the new cluster
  4. Verify that critical workloads are healthy
  5. Measure the time from step 2 to step 4

Document the actual time. Compare it to your stated RTO. If your RTO is 4 hours and the drill takes 6 hours, you have a gap. Find the bottleneck — usually it is EBS snapshot restore times or ordering issues with CRD-dependent resources.

# Velero backup that you can actually test restoring
velero backup create dr-drill-$(date +%Y%m%d) \
  --include-namespaces payments,orders,production \
  --include-cluster-resources=true \
  --snapshot-volumes=true \
  --wait
 
# After new cluster is up, restore
velero restore create dr-drill-restore \
  --from-backup dr-drill-$(date +%Y%m%d) \
  --restore-volumes=true \
  --wait
 
# Validation script — check critical deployments are healthy
for ns in payments orders production; do
  echo "=== Namespace: $ns ==="
  kubectl get deployments -n $ns -o wide
  kubectl get pods -n $ns | grep -v Running | grep -v Completed
done

Backup Validation: The Test You Are Not Running

Running a backup is not the same as having a recoverable backup. Velero can silently fail to snapshot PersistentVolumes if the snapshot location is misconfigured, or back up Secrets as empty if it lacks the right permissions.

# Check that backups are completing successfully and snapshots are being taken
velero backup get
 
# Inspect a specific backup for warnings
velero backup describe daily-backup-20250826 --details
 
# Verify volume snapshots were created
velero backup describe daily-backup-20250826 | grep -A 5 "Volume Snapshots"

Add a weekly alert: if the most recent backup is older than 25 hours, page the on-call. This is a five-minute configuration in your alerting system and catches cases where the backup schedule silently stops working.

Key Takeaways

  • A backup is not a disaster recovery plan. DR requires a tested procedure with measured RTO and RPO, not just a scheduled Velero job that nobody has ever restored.
  • For managed clusters (EKS, GKE, AKS), the control plane backup strategy is: express all cluster configuration in infrastructure code. The provider manages etcd. Your "restore" is terraform apply followed by a Velero restore.
  • Velero backup must include includeClusterResources: true to capture CRDs and ClusterRoles. Restoring custom resources without their CRDs fails.
  • The restore order matters: CRDs first, then namespaces and RBAC, then Secrets and ConfigMaps, then Deployments, then PVCs. Velero handles most of this, but dynamic Secrets (from external-secrets-operator) should not be restored from backup.
  • Run a cluster rebuild drill quarterly. Measure the actual time to recovery and compare it to your stated RTO. The gap between stated and actual is your disaster risk.
  • Alert if the most recent successful backup is older than 25 hours. Silent backup failures are common and discovered at the worst possible time.
Share: