Skip to main content
Kubernetes Pragmatic

Day-2 Operations

Ravinder··8 min read
KubernetesCloud NativeDevOpsPlatform EngineeringUpgradesGitOps
Share:
Day-2 Operations

Day-1 Kubernetes is the conference talk version — provisioning a cluster, deploying your first application, watching the pods spin up. Day-2 is everything after that. It is the part nobody puts in a slide deck: the quarterly upgrade windows, the deprecated APIs that silently break after a control plane upgrade, the configuration drift that accumulates until someone makes a cluster change that nobody can explain, and the team structure question that determines whether any of this is sustainable.

Day-2 is where most Kubernetes investments fail. Not because the technology is wrong, but because the operational model was never designed.

The Upgrade Problem

Kubernetes releases three minor versions per year (1.28, 1.29, 1.30...). Each version is supported for approximately 14 months, which means you must upgrade roughly every 4–6 months to stay on a supported version. This is not optional — unsupported clusters do not receive security patches.

gantt title Kubernetes Upgrade Cadence — Required Pace dateFormat YYYY-Q[Q] section Planning Plan 1.29 upgrade :2024-Q1, 45d Execute 1.29 upgrade :2024-Q2, 30d Plan 1.30 upgrade :2024-Q3, 45d Execute 1.30 upgrade :2024-Q4, 30d Plan 1.31 upgrade :2025-Q1, 45d Execute 1.31 upgrade :2025-Q2, 30d

The upgrade process has a fixed sequence: control plane first, then node groups, then addons. On managed clusters, the control plane upgrade is typically a single API call. The node group upgrade requires rolling replacement of nodes, which means your workloads must survive pod rescheduling — PodDisruptionBudgets must be set correctly or you will get application downtime during node drains.

# EKS upgrade sequence
# Step 1: Upgrade control plane
aws eks update-cluster-version \
  --name production \
  --kubernetes-version 1.30
 
# Wait for control plane upgrade to complete
aws eks wait cluster-active --name production
 
# Step 2: Upgrade managed node groups (rolling node replacement)
aws eks update-nodegroup-version \
  --cluster-name production \
  --nodegroup-name general \
  --kubernetes-version 1.30
 
# Step 3: Upgrade addons (after node groups)
aws eks update-addon --cluster-name production --addon-name coredns --addon-version v1.11.1-eksbuild.4
aws eks update-addon --cluster-name production --addon-name kube-proxy --addon-version v1.30.0-eksbuild.3
aws eks update-addon --cluster-name production --addon-name vpc-cni --addon-version v1.18.1-eksbuild.3
aws eks update-addon --cluster-name production --addon-name aws-ebs-csi-driver --addon-version v1.30.0-eksbuild.1

Addon compatibility is the most common upgrade failure point. An addon version that works on 1.29 may not work on 1.30. Check the compatibility matrix for every addon before upgrading the control plane. The EKS addon compatibility table is maintained by AWS. For self-managed addons (Nginx ingress, cert-manager, Argo CD), check the project's Kubernetes compatibility notes.

API Deprecations: The Silent Upgrade Breaker

Every Kubernetes version deprecates some APIs and removes others. The ones that remove APIs are the dangerous upgrades. When an API is removed and your workloads or Helm charts reference it, kubectl apply will fail against the new API server.

# Detect deprecated API usage before upgrading
# Install kubent (Kubernetes No Trouble)
sh -c "$(curl -sSL https://git.io/install-kubent)"
 
# Run against your current cluster
kubent
 
# Example output:
# ....
# >>> Deprecated APIs <<<
# Ingress networking.k8s.io/v1beta1 is deprecated in 1.19 and removed in 1.22
# Found in: production/api-ingress
# Use networking.k8s.io/v1 instead

The standard removals to plan for: extensions/v1beta1 was removed in 1.16, batch/v1beta1 CronJobs in 1.25, autoscaling/v2beta2 in 1.26. Check the official migration guides for each target version before upgrading.

Helm chart API versions are often the culprit. Many older Helm charts still use deprecated API versions in their templates. The fix is to upgrade to a chart version that uses the current APIs — which may require running helm upgrade with --force to replace resources that cannot be patched in-place.

Configuration Drift

A cluster that has been running for 12+ months accumulates drift: manual kubectl edit changes, one-off patches, experimental configurations that became permanent, and resources nobody remembers creating. Drift is the gap between what your infrastructure code says the cluster should look like and what is actually there.

graph LR GIT[Git — desired state] -->|Should match| CLUSTER[Live cluster — actual state] CLUSTER -->|Drift accumulates| DIFF[Unknown manual changes] DIFF --> RISK[Upgrade failures, security gaps, mystery configs]

GitOps (Argo CD or Flux) is the operational model that keeps drift at zero. Every change to cluster state goes through Git first. The GitOps controller continuously reconciles the live cluster to the declared Git state. Manual changes either get reverted automatically or trigger a visible diff that must be addressed.

# Argo CD Application — declarative cluster state
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: production-apps
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://github.com/my-org/k8s-config
    targetRevision: main
    path: clusters/production/apps
  destination:
    server: https://kubernetes.default.svc
    namespace: production
  syncPolicy:
    automated:
      prune: true       # Remove resources deleted from Git
      selfHeal: true    # Revert manual changes to match Git
    syncOptions:
      - CreateNamespace=true
      - ServerSideApply=true

selfHeal: true is the setting that enforces zero drift. Any manual change to the cluster is detected within 3 minutes and reverted. If someone runs kubectl edit deployment payments-api directly, Argo CD will revert it on the next sync cycle. This feels restrictive — it is. That is the point.

Observability: The Signals You Need Day-to-Day

Day-2 operations without observability is guesswork. The minimum viable stack:

# kube-state-metrics — exposes K8s object state as Prometheus metrics
# Critical for: deployment rollout status, pod crash loops, HPA state
# Install via Helm
helm install kube-state-metrics prometheus-community/kube-state-metrics \
  --namespace monitoring --create-namespace
 
# Key alerts to configure:
# KubePodCrashLooping: pod restarted > 5 times in 15 minutes
# KubeDeploymentReplicasMismatch: desired != ready replicas
# KubePersistentVolumeFillingUp: PVC > 85% full
# KubeNodeNotReady: node not ready > 5 minutes
# KubeJobFailed: batch job completion failed

These five alerts cover the majority of production incidents. Everything else can be added incrementally.

The Team Model

Running Kubernetes at scale requires a platform team. This is not a suggestion — it is the conclusion of watching teams try to run shared clusters without one.

graph TD PT[Platform Team] --> CL[Cluster upgrades and maintenance] PT --> AD[Addon management] PT --> OB[Observability stack] PT --> DR[Disaster recovery testing] PT --> SEC[Security policies and admission control] PT --> GT[GitOps infrastructure] APT[Application Teams] --> WL[Workload deployments] APT --> SCA[Scaling configuration] APT --> MON[Service-level monitoring] APT --> ON[On-call for their services] PT -->|Self-service platform| APT

The platform team's output is not YAML — it is a self-service platform that lets application teams deploy without needing to understand the cluster internals. That means golden path templates for Deployments, documented resource request guidelines, runbooks for common failure modes, and an upgrade communication process that gives application teams advance notice.

The minimum viable platform team is two engineers. Below that, you are spreading cluster operations across engineers who are primarily focused on application development, and the operational quality will degrade until a crisis forces the issue.

The End of the Series: What to Take Away

Nine posts. The through-line: Kubernetes is a sophisticated platform that pays off at scale and penalizes teams that adopt it before they are ready to operate it. The YAML is not the hard part. The upgrade cycles, the networking model, the identity system, the cost allocation, the team structure — those are the hard parts.

graph LR DECIDE[Is K8s right for you?] -->|Yes| CLUSTER[Design the cluster right] CLUSTER --> IDENTITY[Workload identity from day one] IDENTITY --> NET[Understand the networking layers] NET --> STATE[Choose stateful workloads carefully] STATE --> SCALE[Configure autoscaling to converge] SCALE --> COST[Build cost visibility] COST --> TENANT[Define your tenancy model] TENANT --> DR[Test your disaster recovery] DR --> DAY2[Invest in day-2 operations]

Each of these steps has a failure mode that costs engineering weeks to fix after the fact. None of them are optional if you want a cluster that earns its operational complexity.

Key Takeaways

  • Kubernetes upgrades are mandatory every 4–6 months. Build upgrade windows into quarterly planning or face emergency upgrades when versions hit EOL.
  • API deprecation is the most common upgrade surprise. Run kubent against your cluster before every control plane upgrade to find deprecated API usage in Helm charts and manifests.
  • Configuration drift accumulates silently over months. GitOps with selfHeal: true is the only model that reliably prevents it — manual changes get reverted, which forces all changes through the Git review process.
  • The minimum observability floor is five alerts: crash loop backoff, replica mismatch, PVC filling up, node not ready, and job failure. Everything else is additive.
  • A platform team of at least two dedicated engineers is the minimum viable model for a shared production cluster. Below that, cluster operations degrade until a crisis makes the investment obvious.
  • Kubernetes earns its complexity at scale. Every operational investment from cluster design through day-2 operations compounds — but only if you build it deliberately.
Share: