Autoscaling That Converges
Kubernetes autoscaling looks deceptively simple in the documentation: set a CPU target, attach an HPA, watch replicas scale. In practice, autoscaling that actually converges — that responds to real load, releases capacity when load drops, and does not thrash — requires careful calibration across three separate systems that interact in non-obvious ways.
I have seen HPA configurations that scaled up correctly but never scaled down because the cooldown was misconfigured. I have seen Cluster Autoscaler refuse to scale up because a PodDisruptionBudget blocked node draining. I have seen VPA and HPA fight each other over resource requests. All of these are avoidable with the right model.
The Three Layers
HPA and VPA operate on pods. Cluster Autoscaler operates on nodes. They communicate indirectly: when HPA creates pods that cannot be scheduled because no node has capacity, Cluster Autoscaler sees the unschedulable pods and provisions new nodes.
HPA: Getting the Fundamentals Right
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api-service
namespace: production
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api-service
minReplicas: 3
maxReplicas: 50
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 60 # Not 80. See below.
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 70
behavior:
scaleUp:
stabilizationWindowSeconds: 0 # Scale up immediately
policies:
- type: Pods
value: 4
periodSeconds: 60 # Add up to 4 pods per minute
- type: Percent
value: 100
periodSeconds: 60 # Or double replicas, whichever is larger
selectPolicy: Max
scaleDown:
stabilizationWindowSeconds: 300 # Wait 5 minutes before scaling down
policies:
- type: Pods
value: 2
periodSeconds: 60 # Remove at most 2 pods per minuteWhy 60% CPU target, not 80%? The HPA reacts to current metrics. If your target is 80% and a traffic spike arrives, pods need time to spin up. During that window, you are running at 100%+ CPU. Setting the target at 60% gives you a buffer — load can spike 33% above normal before you hit saturation while new pods are starting.
The behavior block is not optional. Without it, HPA uses defaults that cause oscillation. The scale-up policy above allows aggressive scale-up (immediately, up to 100% increase) and conservative scale-down (5-minute stabilization window, 2 pods per minute max). This is the right asymmetry for production web services.
HPA requires metrics-server. If metrics-server is not running, every HPA in the cluster sits idle and kubectl get hpa shows <unknown> for current metrics. This is not obvious from the HPA status message.
VPA: Useful for Resource Right-Sizing, Dangerous in Auto Mode
VPA has three modes: Off, Initial, and Auto. Only Off and Initial are safe to run alongside HPA.
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: api-service
namespace: production
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: api-service
updatePolicy:
updateMode: "Off" # Recommendation only — never evict pods automatically
resourcePolicy:
containerPolicies:
- containerName: api
minAllowed:
cpu: 100m
memory: 128Mi
maxAllowed:
cpu: 4
memory: 4GiIn Off mode, VPA generates recommendations but never acts on them. Use kubectl get vpa api-service -o json to read the recommendations and use them to tune your Deployment resource requests manually. Do this quarterly for any service that has been running long enough to accumulate meaningful metrics.
Running VPA in Auto mode alongside HPA will cause VPA to evict pods to resize them while HPA is simultaneously scaling the replica count. This creates a feedback loop that is unpleasant to debug and actively harmful to availability.
Cluster Autoscaler vs Karpenter
Cluster Autoscaler is the default and works on all clouds. Its limitation is that it scales pre-defined node groups — you must predict which instance types you need and configure them in advance. Scaling latency is typically 3–5 minutes.
Karpenter (EKS-native, GKE has its own equivalent in Autopilot) provisions nodes directly from the EC2 API without pre-configured node groups. It evaluates pending pods and selects the cheapest instance type that satisfies the requirements. Scaling latency is typically under 60 seconds.
# Karpenter NodePool — replaces static node groups
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: general
spec:
template:
metadata:
labels:
role: general
spec:
requirements:
- key: karpenter.sh/capacity-type
operator: In
values: ["on-demand", "spot"]
- key: node.kubernetes.io/instance-category
operator: In
values: ["m", "c", "r"]
- key: node.kubernetes.io/instance-generation
operator: Gt
values: ["5"]
- key: kubernetes.io/arch
operator: In
values: ["amd64"]
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: general
limits:
cpu: 1000
memory: 2000Gi
disruption:
consolidationPolicy: WhenEmptyOrUnderutilized
consolidateAfter: 30sThe disruption.consolidationPolicy: WhenEmptyOrUnderutilized is Karpenter's equivalent of scale-down. It replaces multiple underutilized nodes with fewer, more cost-efficient ones. This is the feature that makes Karpenter compelling for mixed on-demand/spot fleets.
The PodDisruptionBudget Dependency
Neither Cluster Autoscaler nor Karpenter will drain a node if doing so would violate a PodDisruptionBudget. This is a feature — but it becomes a problem when your PDB is configured incorrectly.
# Correct PDB for a 3-replica deployment
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: api-service-pdb
namespace: production
spec:
minAvailable: 2 # Always keep 2 of 3 pods running during disruption
selector:
matchLabels:
app: api-serviceIf you set minAvailable: 3 on a 3-replica deployment, scale-down can never remove a pod because it would bring available replicas below 3. The node will never drain. Cluster Autoscaler will log pod api-service has a PDB and skip the node indefinitely.
The rule: minAvailable should be at most replicas - 1. If you have 3 replicas, minAvailable: 2. If you have 10 replicas, minAvailable: 8.
Custom Metrics Scaling
CPU and memory are lagging indicators. For queue-based workers, the more accurate signal is queue depth.
# HPA scaling based on SQS queue depth via KEDA
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: worker-scaler
namespace: production
spec:
scaleTargetRef:
name: queue-worker
minReplicaCount: 0 # Scale to zero when queue is empty
maxReplicaCount: 100
triggers:
- type: aws-sqs-queue
metadata:
queueURL: https://sqs.us-east-1.amazonaws.com/123456789/work-queue
queueLength: "10" # 1 replica per 10 messages
awsRegion: us-east-1
identityOwner: operator # Uses IRSAKEDA (Kubernetes Event-Driven Autoscaler) bridges the gap between HPA's metric limitations and real event sources. It integrates with SQS, Kafka, Redis, Prometheus, and dozens of other sources. For batch and worker workloads, it is often more appropriate than CPU-based HPA.
Key Takeaways
- HPA, VPA, and Cluster Autoscaler are independent systems that interact — understand the interaction before combining them. Running VPA in
Automode alongside HPA causes pod eviction loops. - Set HPA CPU targets at 60–70%, not 80–90%. You need headroom for scale-up latency.
- The
behaviorblock in HPA is essential for production. Configure aggressive scale-up and conservative scale-down with a 5-minute stabilization window minimum. - Karpenter outperforms Cluster Autoscaler on EKS for heterogeneous workloads — faster scale-up, instance type flexibility, and cost-driven consolidation.
- PodDisruptionBudgets must allow at least one pod to be disrupted or node draining will deadlock. Set
minAvailableto at mostreplicas - 1. - For queue-based workers, KEDA with event source metrics (SQS queue depth, Kafka consumer lag) is more accurate and responsive than CPU-based HPA.