Day-2 Operations
← Part 9
Disaster Recovery
Day-1 Kubernetes is the conference talk version — provisioning a cluster, deploying your first application, watching the pods spin up. Day-2 is everything after that. It is the part nobody puts in a slide deck: the quarterly upgrade windows, the deprecated APIs that silently break after a control plane upgrade, the configuration drift that accumulates until someone makes a cluster change that nobody can explain, and the team structure question that determines whether any of this is sustainable.
Day-2 is where most Kubernetes investments fail. Not because the technology is wrong, but because the operational model was never designed.
The Upgrade Problem
Kubernetes releases three minor versions per year (1.28, 1.29, 1.30...). Each version is supported for approximately 14 months, which means you must upgrade roughly every 4–6 months to stay on a supported version. This is not optional — unsupported clusters do not receive security patches.
The upgrade process has a fixed sequence: control plane first, then node groups, then addons. On managed clusters, the control plane upgrade is typically a single API call. The node group upgrade requires rolling replacement of nodes, which means your workloads must survive pod rescheduling — PodDisruptionBudgets must be set correctly or you will get application downtime during node drains.
# EKS upgrade sequence
# Step 1: Upgrade control plane
aws eks update-cluster-version \
--name production \
--kubernetes-version 1.30
# Wait for control plane upgrade to complete
aws eks wait cluster-active --name production
# Step 2: Upgrade managed node groups (rolling node replacement)
aws eks update-nodegroup-version \
--cluster-name production \
--nodegroup-name general \
--kubernetes-version 1.30
# Step 3: Upgrade addons (after node groups)
aws eks update-addon --cluster-name production --addon-name coredns --addon-version v1.11.1-eksbuild.4
aws eks update-addon --cluster-name production --addon-name kube-proxy --addon-version v1.30.0-eksbuild.3
aws eks update-addon --cluster-name production --addon-name vpc-cni --addon-version v1.18.1-eksbuild.3
aws eks update-addon --cluster-name production --addon-name aws-ebs-csi-driver --addon-version v1.30.0-eksbuild.1Addon compatibility is the most common upgrade failure point. An addon version that works on 1.29 may not work on 1.30. Check the compatibility matrix for every addon before upgrading the control plane. The EKS addon compatibility table is maintained by AWS. For self-managed addons (Nginx ingress, cert-manager, Argo CD), check the project's Kubernetes compatibility notes.
API Deprecations: The Silent Upgrade Breaker
Every Kubernetes version deprecates some APIs and removes others. The ones that remove APIs are the dangerous upgrades. When an API is removed and your workloads or Helm charts reference it, kubectl apply will fail against the new API server.
# Detect deprecated API usage before upgrading
# Install kubent (Kubernetes No Trouble)
sh -c "$(curl -sSL https://git.io/install-kubent)"
# Run against your current cluster
kubent
# Example output:
# ....
# >>> Deprecated APIs <<<
# Ingress networking.k8s.io/v1beta1 is deprecated in 1.19 and removed in 1.22
# Found in: production/api-ingress
# Use networking.k8s.io/v1 insteadThe standard removals to plan for: extensions/v1beta1 was removed in 1.16, batch/v1beta1 CronJobs in 1.25, autoscaling/v2beta2 in 1.26. Check the official migration guides for each target version before upgrading.
Helm chart API versions are often the culprit. Many older Helm charts still use deprecated API versions in their templates. The fix is to upgrade to a chart version that uses the current APIs — which may require running helm upgrade with --force to replace resources that cannot be patched in-place.
Configuration Drift
A cluster that has been running for 12+ months accumulates drift: manual kubectl edit changes, one-off patches, experimental configurations that became permanent, and resources nobody remembers creating. Drift is the gap between what your infrastructure code says the cluster should look like and what is actually there.
GitOps (Argo CD or Flux) is the operational model that keeps drift at zero. Every change to cluster state goes through Git first. The GitOps controller continuously reconciles the live cluster to the declared Git state. Manual changes either get reverted automatically or trigger a visible diff that must be addressed.
# Argo CD Application — declarative cluster state
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: production-apps
namespace: argocd
spec:
project: default
source:
repoURL: https://github.com/my-org/k8s-config
targetRevision: main
path: clusters/production/apps
destination:
server: https://kubernetes.default.svc
namespace: production
syncPolicy:
automated:
prune: true # Remove resources deleted from Git
selfHeal: true # Revert manual changes to match Git
syncOptions:
- CreateNamespace=true
- ServerSideApply=trueselfHeal: true is the setting that enforces zero drift. Any manual change to the cluster is detected within 3 minutes and reverted. If someone runs kubectl edit deployment payments-api directly, Argo CD will revert it on the next sync cycle. This feels restrictive — it is. That is the point.
Observability: The Signals You Need Day-to-Day
Day-2 operations without observability is guesswork. The minimum viable stack:
# kube-state-metrics — exposes K8s object state as Prometheus metrics
# Critical for: deployment rollout status, pod crash loops, HPA state
# Install via Helm
helm install kube-state-metrics prometheus-community/kube-state-metrics \
--namespace monitoring --create-namespace
# Key alerts to configure:
# KubePodCrashLooping: pod restarted > 5 times in 15 minutes
# KubeDeploymentReplicasMismatch: desired != ready replicas
# KubePersistentVolumeFillingUp: PVC > 85% full
# KubeNodeNotReady: node not ready > 5 minutes
# KubeJobFailed: batch job completion failedThese five alerts cover the majority of production incidents. Everything else can be added incrementally.
The Team Model
Running Kubernetes at scale requires a platform team. This is not a suggestion — it is the conclusion of watching teams try to run shared clusters without one.
The platform team's output is not YAML — it is a self-service platform that lets application teams deploy without needing to understand the cluster internals. That means golden path templates for Deployments, documented resource request guidelines, runbooks for common failure modes, and an upgrade communication process that gives application teams advance notice.
The minimum viable platform team is two engineers. Below that, you are spreading cluster operations across engineers who are primarily focused on application development, and the operational quality will degrade until a crisis forces the issue.
The End of the Series: What to Take Away
Nine posts. The through-line: Kubernetes is a sophisticated platform that pays off at scale and penalizes teams that adopt it before they are ready to operate it. The YAML is not the hard part. The upgrade cycles, the networking model, the identity system, the cost allocation, the team structure — those are the hard parts.
Each of these steps has a failure mode that costs engineering weeks to fix after the fact. None of them are optional if you want a cluster that earns its operational complexity.
Key Takeaways
- Kubernetes upgrades are mandatory every 4–6 months. Build upgrade windows into quarterly planning or face emergency upgrades when versions hit EOL.
- API deprecation is the most common upgrade surprise. Run
kubentagainst your cluster before every control plane upgrade to find deprecated API usage in Helm charts and manifests. - Configuration drift accumulates silently over months. GitOps with
selfHeal: trueis the only model that reliably prevents it — manual changes get reverted, which forces all changes through the Git review process. - The minimum observability floor is five alerts: crash loop backoff, replica mismatch, PVC filling up, node not ready, and job failure. Everything else is additive.
- A platform team of at least two dedicated engineers is the minimum viable model for a shared production cluster. Below that, cluster operations degrade until a crisis makes the investment obvious.
- Kubernetes earns its complexity at scale. Every operational investment from cluster design through day-2 operations compounds — but only if you build it deliberately.
← Part 9
Disaster Recovery