Skip to main content
Kubernetes Pragmatic

The Cluster You Actually Want

Ravinder··7 min read
KubernetesCloud NativeDevOpsEKSGKENode Pools
Share:
The Cluster You Actually Want

The first cluster most teams build is the cluster they regret. It starts as a single node group, gets addon after addon bolted on as requirements surface, and ends up as an undocumented archaeological site. The second cluster — built after that pain — is much better, because now you know what you actually needed versus what you thought you needed.

This post is about skipping straight to the second cluster.

Start With the Control Plane Decision

The managed control plane is the one place you genuinely do not want to own. GKE, EKS, and AKS each give you a control plane where etcd, the API server, scheduler, and controller manager are handled by the cloud provider. The differences that actually matter in practice:

GKE EKS AKS
Autopilot mode Yes (fully managed nodes) No equivalent No equivalent
Default CNI Dataplane V2 (Cilium-based) VPC-CNI Azure CNI
Upgrade automation Auto-upgrade channels Managed node groups Node pool auto-upgrade
IAM integration Workload Identity IRSA Azure Workload Identity

GKE Autopilot is worth serious consideration if you want Google to manage node lifecycle entirely. You lose the ability to run privileged pods and DaemonSets on arbitrary nodes, which is a real constraint for some workloads — but for most API and batch workloads, it eliminates an entire category of toil.

EKS is the choice when you are deep in the AWS ecosystem and need tight IAM and VPC integration. The operational overhead is higher than GKE Autopilot, but the control is also higher.

Node Pool Design That Ages Well

A single default node pool is a trap. You will eventually need to separate workload classes, and retrofitting node selectors and taints after the fact is painful. The pattern that works:

graph TD A[Cluster] --> B[System Pool] A --> C[General Workload Pool] A --> D[Memory-Optimized Pool] A --> E[Spot / Preemptible Pool] B -->|Runs| F[CoreDNS, metrics-server, CSI drivers] C -->|Runs| G[Standard API services] D -->|Runs| H[ML inference, caching] E -->|Runs| I[Batch jobs, CI runners]

Define each pool with explicit taints. Anything that should not land on the system pool gets a NoSchedule taint, and you tolerate it only on workloads that need it.

# EKS managed node group — system pool
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
  name: production
  region: us-east-1
managedNodeGroups:
  - name: system
    instanceType: m6i.large
    minSize: 2
    maxSize: 4
    labels:
      role: system
    taints:
      - key: CriticalAddonsOnly
        value: "true"
        effect: NoSchedule
    tags:
      k8s.io/cluster-autoscaler/enabled: "true"
 
  - name: general
    instanceType: m6i.2xlarge
    minSize: 2
    maxSize: 20
    labels:
      role: general
    tags:
      k8s.io/cluster-autoscaler/enabled: "true"
 
  - name: spot
    instanceTypes: ["m6i.2xlarge", "m5.2xlarge", "m5d.2xlarge"]
    spot: true
    minSize: 0
    maxSize: 30
    labels:
      role: spot
    taints:
      - key: spot
        value: "true"
        effect: NoSchedule

The spot pool starts at zero. Cluster Autoscaler provisions it on demand. Batch workloads tolerate the spot=true:NoSchedule taint and get scheduled there. API services never land on spot because they do not carry the toleration.

The Addon Stack — What You Actually Need

This is where teams overload themselves. Every addon sounds useful until you are debugging why your cluster upgrade failed because the cert-manager CRD version is incompatible with the new API server.

Tier 1 — Required from day one:

  • metrics-server — HPA will not function without it
  • cluster-autoscaler — unless you are on GKE Autopilot or Karpenter
  • CoreDNS (already installed, but review its configuration)
  • A CSI driver for your storage backend (EBS CSI, GCE PD CSI, Azure Disk CSI)

Tier 2 — Add when you have a specific need:

  • cert-manager — when you need TLS certificate automation
  • An ingress controller (AWS LB Controller, Nginx, or Traefik) — when you have HTTP workloads
  • external-dns — when you need DNS records managed automatically
  • kube-state-metrics — when you set up Prometheus

Tier 3 — Evaluate carefully, add last:

  • OPA/Gatekeeper or Kyverno — when you have policy requirements
  • Linkerd or Istio — when you need mTLS or traffic management at mesh level
  • Argo CD or Flux — when you are ready to commit to GitOps
# Helm values for cert-manager — minimal, sane defaults
# Install only when you have TLS cert requirements
apiVersion: helm.cattle.io/v1
kind: HelmChart
metadata:
  name: cert-manager
  namespace: kube-system
spec:
  chart: cert-manager
  repo: https://charts.jetstack.io
  version: v1.14.0
  targetNamespace: cert-manager
  valuesContent: |-
    installCRDs: true
    replicaCount: 2
    resources:
      requests:
        cpu: 50m
        memory: 64Mi
      limits:
        memory: 256Mi
    podDisruptionBudget:
      enabled: true
      minAvailable: 1

Version Skew and Upgrade Windows

Kubernetes releases three minor versions per year and EOLs them after approximately 14 months. The practical implication: you must upgrade roughly every 4–6 months if you want to stay on a supported version, or more aggressively if you have two versions of lag.

Build the upgrade cadence into your planning. The teams that do not end up in emergency upgrade situations where the only path forward is a disruptive version jump that breaks APIs they are using.

gantt title Kubernetes Version Lifecycle Planning dateFormat YYYY-MM section K8s 1.28 Active Support :2023-08, 6M Maintenance :2024-02, 8M EOL :milestone, 2024-10, 0M section K8s 1.29 Active Support :2024-01, 6M Maintenance :2024-07, 8M EOL :milestone, 2025-03, 0M section K8s 1.30 Active Support :2024-04, 6M Maintenance :2024-10, 8M EOL :milestone, 2025-06, 0M

Use your cloud provider's upgrade channels where available. GKE's regular channel and EKS's managed node group rolling updates are both safe for production if you have PodDisruptionBudgets set on your workloads.

The Infrastructure-as-Code Commitment

Resist the temptation to click through the console to create your first cluster. Any configuration not in code will be forgotten. The cluster you actually want is the one you can rebuild from scratch in under 30 minutes.

# Terraform — EKS cluster skeleton
module "eks" {
  source  = "terraform-aws-modules/eks/aws"
  version = "~> 20.0"
 
  cluster_name    = "production"
  cluster_version = "1.30"
 
  cluster_endpoint_public_access = true
 
  vpc_id     = module.vpc.vpc_id
  subnet_ids = module.vpc.private_subnets
 
  eks_managed_node_groups = {
    system = {
      min_size       = 2
      max_size       = 4
      instance_types = ["m6i.large"]
      taints = [{
        key    = "CriticalAddonsOnly"
        value  = "true"
        effect = "NO_SCHEDULE"
      }]
    }
 
    general = {
      min_size       = 2
      max_size       = 20
      instance_types = ["m6i.2xlarge"]
    }
  }
 
  cluster_addons = {
    coredns    = { most_recent = true }
    kube-proxy = { most_recent = true }
    vpc-cni    = { most_recent = true }
    aws-ebs-csi-driver = { most_recent = true }
  }
}

Key Takeaways

  • Choose a managed control plane and do not look back — the time spent operating etcd is time not spent on your product.
  • Design node pools by workload class from day one: system, general, and spot as a minimum. Retrofitting taints and labels after workloads are running is painful.
  • Install addons in tiers. Start with only what is required for HPA and storage, and add the rest when you have a concrete requirement.
  • Kubernetes version upgrades are not optional events. Build an upgrade window into your quarterly planning or you will be forced into emergency upgrades.
  • Every cluster configuration decision must live in version-controlled infrastructure code. If it is not in Terraform or eksctl YAML, it does not exist.
  • The cluster you want is one you can destroy and rebuild in 30 minutes without tribal knowledge.
Share: