Kubernetes Pragmatic

Autoscaling That Converges

Ravinder·August 5, 2025·7 min read

KubernetesCloud NativeDevOpsHPAAutoscalingCluster Autoscaler

Series

Kubernetes Without the YAML Stockholm Syndrome

Part 6 of 10

← Part 5

Stateful Workloads

Part 7 →

Cost Visibility

Kubernetes autoscaling looks deceptively simple in the documentation: set a CPU target, attach an HPA, watch replicas scale. In practice, autoscaling that actually converges — that responds to real load, releases capacity when load drops, and does not thrash — requires careful calibration across three separate systems that interact in non-obvious ways.

I have seen HPA configurations that scaled up correctly but never scaled down because the cooldown was misconfigured. I have seen Cluster Autoscaler refuse to scale up because a PodDisruptionBudget blocked node draining. I have seen VPA and HPA fight each other over resource requests. All of these are avoidable with the right model.

The Three Layers

graph TB subgraph Application["Application Layer"] HPA[HPA — scales pod replicas] VPA[VPA — adjusts pod resource requests] end subgraph Cluster["Cluster Layer"] CA[Cluster Autoscaler / Karpenter — scales nodes] end subgraph Infra["Infrastructure"] NG[Node Groups / EC2 Spot] end HPA -->|Pending pods trigger| CA VPA -->|Evicts pods to resize| CA CA --> NG

HPA and VPA operate on pods. Cluster Autoscaler operates on nodes. They communicate indirectly: when HPA creates pods that cannot be scheduled because no node has capacity, Cluster Autoscaler sees the unschedulable pods and provisions new nodes.

HPA: Getting the Fundamentals Right

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-service
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-service
  minReplicas: 3
  maxReplicas: 50
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 60  # Not 80. See below.
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 70
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 0  # Scale up immediately
      policies:
        - type: Pods
          value: 4
          periodSeconds: 60  # Add up to 4 pods per minute
        - type: Percent
          value: 100
          periodSeconds: 60  # Or double replicas, whichever is larger
      selectPolicy: Max
    scaleDown:
      stabilizationWindowSeconds: 300  # Wait 5 minutes before scaling down
      policies:
        - type: Pods
          value: 2
          periodSeconds: 60  # Remove at most 2 pods per minute

Why 60% CPU target, not 80%? The HPA reacts to current metrics. If your target is 80% and a traffic spike arrives, pods need time to spin up. During that window, you are running at 100%+ CPU. Setting the target at 60% gives you a buffer — load can spike 33% above normal before you hit saturation while new pods are starting.

The behavior block is not optional. Without it, HPA uses defaults that cause oscillation. The scale-up policy above allows aggressive scale-up (immediately, up to 100% increase) and conservative scale-down (5-minute stabilization window, 2 pods per minute max). This is the right asymmetry for production web services.

HPA requires metrics-server. If metrics-server is not running, every HPA in the cluster sits idle and kubectl get hpa shows <unknown> for current metrics. This is not obvious from the HPA status message.

VPA: Useful for Resource Right-Sizing, Dangerous in Auto Mode

VPA has three modes: Off, Initial, and Auto. Only Off and Initial are safe to run alongside HPA.

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: api-service
  namespace: production
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-service
  updatePolicy:
    updateMode: "Off"  # Recommendation only — never evict pods automatically
  resourcePolicy:
    containerPolicies:
      - containerName: api
        minAllowed:
          cpu: 100m
          memory: 128Mi
        maxAllowed:
          cpu: 4
          memory: 4Gi

In Off mode, VPA generates recommendations but never acts on them. Use kubectl get vpa api-service -o json to read the recommendations and use them to tune your Deployment resource requests manually. Do this quarterly for any service that has been running long enough to accumulate meaningful metrics.

Running VPA in Auto mode alongside HPA will cause VPA to evict pods to resize them while HPA is simultaneously scaling the replica count. This creates a feedback loop that is unpleasant to debug and actively harmful to availability.

Cluster Autoscaler vs Karpenter

graph LR subgraph CA["Cluster Autoscaler"] CA1[Watches unschedulable pods] CA2[Selects matching node group] CA3[Scales group up by 1-N nodes] CA4[Waits for node group API] CA1 --> CA2 --> CA3 --> CA4 end subgraph KP["Karpenter - EKS"] KP1[Watches unschedulable pods] KP2[Computes optimal instance type] KP3[Provisions instance directly via EC2 API] KP4[Node joins cluster in ~45s] KP1 --> KP2 --> KP3 --> KP4 end

Cluster Autoscaler is the default and works on all clouds. Its limitation is that it scales pre-defined node groups — you must predict which instance types you need and configure them in advance. Scaling latency is typically 3–5 minutes.

Karpenter (EKS-native, GKE has its own equivalent in Autopilot) provisions nodes directly from the EC2 API without pre-configured node groups. It evaluates pending pods and selects the cheapest instance type that satisfies the requirements. Scaling latency is typically under 60 seconds.

# Karpenter NodePool — replaces static node groups
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: general
spec:
  template:
    metadata:
      labels:
        role: general
    spec:
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["on-demand", "spot"]
        - key: node.kubernetes.io/instance-category
          operator: In
          values: ["m", "c", "r"]
        - key: node.kubernetes.io/instance-generation
          operator: Gt
          values: ["5"]
        - key: kubernetes.io/arch
          operator: In
          values: ["amd64"]
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: general
  limits:
    cpu: 1000
    memory: 2000Gi
  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 30s

The disruption.consolidationPolicy: WhenEmptyOrUnderutilized is Karpenter's equivalent of scale-down. It replaces multiple underutilized nodes with fewer, more cost-efficient ones. This is the feature that makes Karpenter compelling for mixed on-demand/spot fleets.

The PodDisruptionBudget Dependency

Neither Cluster Autoscaler nor Karpenter will drain a node if doing so would violate a PodDisruptionBudget. This is a feature — but it becomes a problem when your PDB is configured incorrectly.

# Correct PDB for a 3-replica deployment
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: api-service-pdb
  namespace: production
spec:
  minAvailable: 2  # Always keep 2 of 3 pods running during disruption
  selector:
    matchLabels:
      app: api-service

If you set minAvailable: 3 on a 3-replica deployment, scale-down can never remove a pod because it would bring available replicas below 3. The node will never drain. Cluster Autoscaler will log pod api-service has a PDB and skip the node indefinitely.

The rule: minAvailable should be at most replicas - 1. If you have 3 replicas, minAvailable: 2. If you have 10 replicas, minAvailable: 8.

Custom Metrics Scaling

CPU and memory are lagging indicators. For queue-based workers, the more accurate signal is queue depth.

# HPA scaling based on SQS queue depth via KEDA
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: worker-scaler
  namespace: production
spec:
  scaleTargetRef:
    name: queue-worker
  minReplicaCount: 0  # Scale to zero when queue is empty
  maxReplicaCount: 100
  triggers:
    - type: aws-sqs-queue
      metadata:
        queueURL: https://sqs.us-east-1.amazonaws.com/123456789/work-queue
        queueLength: "10"  # 1 replica per 10 messages
        awsRegion: us-east-1
        identityOwner: operator  # Uses IRSA

KEDA (Kubernetes Event-Driven Autoscaler) bridges the gap between HPA's metric limitations and real event sources. It integrates with SQS, Kafka, Redis, Prometheus, and dozens of other sources. For batch and worker workloads, it is often more appropriate than CPU-based HPA.

Key Takeaways

HPA, VPA, and Cluster Autoscaler are independent systems that interact — understand the interaction before combining them. Running VPA in Auto mode alongside HPA causes pod eviction loops.
Set HPA CPU targets at 60–70%, not 80–90%. You need headroom for scale-up latency.
The behavior block in HPA is essential for production. Configure aggressive scale-up and conservative scale-down with a 5-minute stabilization window minimum.
Karpenter outperforms Cluster Autoscaler on EKS for heterogeneous workloads — faster scale-up, instance type flexibility, and cost-driven consolidation.
PodDisruptionBudgets must allow at least one pod to be disrupted or node draining will deadlock. Set minAvailable to at most replicas - 1.
For queue-based workers, KEDA with event source metrics (SQS queue depth, Kafka consumer lag) is more accurate and responsive than CPU-based HPA.