Kubernetes Pragmatic

Stateful Workloads

Ravinder·July 29, 2025·6 min read

KubernetesCloud NativeDevOpsStatefulSetsOperatorsDatabases

Series

Kubernetes Without the YAML Stockholm Syndrome

Part 5 of 10

← Part 4

Networking Primitives That Bite

Part 6 →

Autoscaling That Converges

Running a database on Kubernetes is one of those decisions that sounds reasonable in a design meeting and reveals its true cost eighteen months later when you are on-call at 2 AM debugging a PersistentVolume that got stuck in Terminating state while your stateful pod fails to reschedule because the storage topology affinity does not match any available node.

That is a specific situation. It has happened. More than once, at more than one company.

The question is not whether you can run stateful workloads on Kubernetes. You can. The question is whether the operational complexity is justified given what managed cloud services now offer.

What StatefulSets Actually Give You

A StatefulSet is a Deployment with three additional guarantees:

Stable network identities — pods get predictable DNS names: pod-0.svc.namespace.svc.cluster.local
Stable storage — each pod gets its own PersistentVolumeClaim that survives pod rescheduling
Ordered deployment and scaling — pods are created, updated, and deleted in order (0, 1, 2...) unless you set podManagementPolicy: Parallel

graph TD SS[StatefulSet: postgres] --> P0[postgres-0] SS --> P1[postgres-1] SS --> P2[postgres-2] P0 -->|Bound to| PVC0[PVC: data-postgres-0] P1 -->|Bound to| PVC1[PVC: data-postgres-1] P2 -->|Bound to| PVC2[PVC: data-postgres-2] PVC0 --> PV0[PersistentVolume: gp3 EBS] PVC1 --> PV1[PersistentVolume: gp3 EBS] PVC2 --> PV2[PersistentVolume: gp3 EBS] HS[Headless Service: postgres] --> P0 HS --> P1 HS --> P2

The headless Service (ClusterIP: None) is what gives each pod its DNS name. Without it, you lose the stable identity guarantee.

apiVersion: v1
kind: Service
metadata:
  name: postgres
  namespace: data
spec:
  clusterIP: None  # Headless — required for StatefulSet stable DNS
  selector:
    app: postgres
  ports:
    - port: 5432
      name: postgres
 
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: postgres
  namespace: data
spec:
  serviceName: postgres  # Must reference the headless service
  replicas: 3
  selector:
    matchLabels:
      app: postgres
  template:
    metadata:
      labels:
        app: postgres
    spec:
      containers:
        - name: postgres
          image: postgres:16
          ports:
            - containerPort: 5432
          env:
            - name: POSTGRES_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: postgres-secret
                  key: password
            - name: PGDATA
              value: /var/lib/postgresql/data/pgdata
          volumeMounts:
            - name: data
              mountPath: /var/lib/postgresql/data
          resources:
            requests:
              cpu: "500m"
              memory: "1Gi"
            limits:
              memory: "2Gi"
  volumeClaimTemplates:
    - metadata:
        name: data
      spec:
        accessModes: ["ReadWriteOnce"]
        storageClassName: gp3
        resources:
          requests:
            storage: 50Gi

This gives you a running PostgreSQL pod with persistent storage. It does not give you replication, automatic failover, backup management, connection pooling, or schema migration management. All of that is your problem.

Operators: The Abstraction That Actually Helps

The Kubernetes operator pattern exists because StatefulSets alone are insufficient for complex stateful applications. An operator encodes operational knowledge — how to bootstrap a cluster, handle failover, rotate credentials, resize volumes — into a controller that runs inside Kubernetes.

The operators worth knowing:

CloudNativePG (PostgreSQL) — the most production-ready Postgres operator. Handles replication, failover, connection pooling via PgBouncer, and backup to S3/GCS.
Strimzi (Kafka) — mature Kafka operator with topic management, user management, and Kafka Connect support.
Redis Operator (Spotahome) — Redis Sentinel setup with automatic failover.
VictoriaMetrics Operator — if you are running your own metrics stack.

# CloudNativePG — Postgres cluster with HA and S3 backup
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: postgres-ha
  namespace: data
spec:
  instances: 3
  imageName: ghcr.io/cloudnative-pg/postgresql:16
 
  postgresql:
    parameters:
      max_connections: "200"
      shared_buffers: "256MB"
 
  storage:
    size: 50Gi
    storageClass: gp3
 
  backup:
    retentionPolicy: "30d"
    barmanObjectStore:
      destinationPath: s3://my-postgres-backups/production
      s3Credentials:
        inheritFromIAMRole: true  # Uses IRSA — no static keys
 
  resources:
    requests:
      memory: "1Gi"
      cpu: "500m"
    limits:
      memory: "2Gi"

This is substantially better than a raw StatefulSet. CloudNativePG handles primary election, configures streaming replication automatically, and runs scheduled backups. But you still own the operator version, the CRD lifecycle, and the upgrade path between major PostgreSQL versions.

The Honest Cost Comparison

graph LR subgraph Kubernetes["Stateful K8s"] OPS[Your Team Owns] OPS --> A[Operator upgrades] OPS --> B[Volume resizing] OPS --> C[Backup verification] OPS --> D[Major version upgrades] OPS --> E[Replication lag monitoring] OPS --> F[Connection pool tuning] end subgraph Managed["Managed Cloud Service - RDS/Cloud SQL"] CLOUD[Cloud Provider Owns] CLOUD --> G[Multi-AZ failover] CLOUD --> H[Automated backups + PITR] CLOUD --> I[Minor version patches] CLOUD --> J[Storage auto-scaling] YOUR[Your Team Owns] YOUR --> K[Major version upgrades] YOUR --> L[Parameter group tuning] end

RDS Aurora PostgreSQL with Multi-AZ costs roughly 20–40% more than self-managed EC2 equivalent. It buys you automated failover in under 30 seconds, point-in-time recovery, automated minor version patching, and storage auto-scaling. The premium is almost always worth it for your primary operational database.

Where running stateful workloads on Kubernetes makes sense:

Caching layers (Redis, Memcached) where data loss is acceptable and simplicity matters
Message queues where you are already on Kubernetes and Strimzi/RabbitMQ operator maturity is sufficient
Development and staging databases where cost matters more than HA guarantees
Workloads with unusual scheduling requirements — GPU-backed vector databases, specialized hardware affinity

Volume Gotchas That Will Bite You

# StorageClass with WaitForFirstConsumer — required for multi-AZ clusters
# Without this, volumes provision in a random AZ and pods may not schedule there
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: gp3
provisioner: ebs.csi.aws.com
parameters:
  type: gp3
  iops: "3000"
  throughput: "125"
  encrypted: "true"
volumeBindingMode: WaitForFirstConsumer  # Binds volume to node's AZ
reclaimPolicy: Retain  # Don't delete the EBS volume when PVC is deleted

reclaimPolicy: Retain is the defensive default for production stateful data. The default Delete will remove the underlying EBS volume when you delete the PVC — including if you accidentally delete the StatefulSet. Set it to Retain and clean up manually when you are certain the data is no longer needed.

volumeBindingMode: WaitForFirstConsumer prevents the scheduler from placing a pod on a node in us-east-1b when the EBS volume was provisioned in us-east-1a. Without this, pods can get stuck in Pending indefinitely.

Key Takeaways

StatefulSets provide stable network identity, stable storage, and ordered lifecycle — they do not provide replication, failover, or backup. Those require operators or managed services.
Operators (CloudNativePG, Strimzi) encode operational runbooks into Kubernetes controllers and are a meaningful improvement over raw StatefulSets for complex stateful applications.
Managed cloud databases (RDS, Cloud SQL, ElastiCache) offload failover, patching, and backup to the cloud provider. The cost premium is usually worth it for primary operational data.
Set reclaimPolicy: Retain on StorageClasses for production data. The default Delete will destroy EBS volumes when PVCs are removed.
Set volumeBindingMode: WaitForFirstConsumer on multi-AZ clusters to prevent topology mismatch between pod scheduling and volume availability zones.
The real cost of stateful Kubernetes is not in the initial setup — it is in the ongoing operator upgrades, volume management, and major version migration work that accumulates over time.