Cost Optimization
A Kafka cluster's cloud bill has four line items that most teams don't examine until the bill arrives: compute (brokers), disk (EBS or equivalent), network (inter-AZ traffic), and managed service fees if you're on Confluent Cloud or MSK. Of these, the two that cause the most surprise are disk and network—because they scale with data volume in ways that aren't obvious until you're paying for them.
This post works through each cost driver, gives you the math to estimate your current spend, and covers the levers that actually move the needle.
Breaking Down Where the Money Goes
The distribution shifts significantly on managed services. On Confluent Cloud, the compute cost is bundled into CKU pricing, and cross-AZ traffic is typically not billed separately. On MSK, you pay EC2 + EBS + cross-AZ data transfer—the last of which is the most common budget surprise.
Cross-AZ Traffic: The Hidden Bill
AWS charges $0.01/GB for cross-AZ data transfer in each direction. At RF=3 with brokers spread across three AZs, every byte written by a producer crosses AZ boundaries twice during replication.
cross_az_gb_per_month = ingress_MB_s × 2 × 86400 × 30 / 1024
cost_per_month = cross_az_gb_per_month × $0.01For 100 MB/s ingress:
100 MB/s × 2 × 86400 × 30 / 1024 = 507,812 GB
507,812 × $0.01 = $5,078/monthJust for replication traffic. Consumer fetches from followers in different AZs add more. This is why a cluster that looks cheap on compute can have a surprising network bill.
Mitigation 1: Rack-aware replica placement
Configure brokers with broker.rack matching their AZ. Kafka's rack-aware replica assignment tries to place each partition's replicas in different AZs—which is already what you want for availability. For consumers, prefer followers in the same AZ.
# server.properties for broker in us-east-1a
broker.rack=us-east-1a// Consumer: fetch from closest replica
props.put(ConsumerConfig.CLIENT_RACK_CONFIG, "us-east-1a");
// Requires broker setting: replica.selector.class=org.apache.kafka.common.replica.RackAwareReplicaSelectorMitigation 2: Compress before replication
Compression reduces the bytes that cross AZ boundaries. At 60% compression ratio with lz4, your cross-AZ traffic drops from $5,078 to ~$2,031/month.
Retention: The Disk Cost Multiplier
Retention is the biggest single lever for disk cost. From post 1's math: disk cost scales linearly with retention.
def monthly_disk_cost_usd(
ingress_mb_s: float,
rf: int,
retention_hours: int,
disk_price_per_gb_month: float = 0.10, # gp3 pricing
compression_ratio: float = 0.60,
) -> float:
total_gb = (ingress_mb_s * rf * retention_hours * 3600
* compression_ratio / 1024)
return total_gb * disk_price_per_gb_month
# 100 MB/s, RF=3, varying retention
for hours in [24, 48, 168]: # 1d, 2d, 7d
cost = monthly_disk_cost_usd(100, 3, hours)
print(f"{hours:4}h retention: ${cost:,.0f}/month") 24h retention: $1,555/month
48h retention: $3,110/month
168h retention: $10,886/monthThe question to ask every topic owner: what is your actual re-processing window? Most teams say "7 days" without knowing that their downstream consumers replay from S3, not from Kafka. If the consumer fails and replays from S3, Kafka retention beyond 48 hours is paying for safety theater.
Tiered Storage: Decoupling Retention from Broker Disk
Tiered storage moves log segments older than a configurable threshold from broker disk to object storage (S3, GCS). Brokers retain only recent data; historical data is fetched on demand.
# server.properties (Confluent Platform or MSK with tiered storage)
remote.log.storage.system.enable=true
log.local.retention.ms=86400000 # Keep 24h on broker disk
log.retention.ms=2592000000 # Keep 30 days total (rest in S3)
# Topic-level override
kafka-configs --bootstrap-server broker1:9092 \
--alter --entity-type topics --entity-name orders \
--add-config remote.storage.enable=true,\
local.retention.ms=86400000,\
retention.ms=2592000000The economics: S3 costs ~$0.023/GB/month. EBS gp3 costs ~$0.08–0.10/GB/month. Tiered storage replaces 85–90% of your broker disk with S3, reducing storage cost by 70–75% for long-retention topics.
Cost for the same 100 MB/s, RF=3, 30-day retention:
- Without tiered storage: EBS cost ≈ $136,000/month
- With tiered storage (24h on EBS, 29d on S3): EBS ≈ $4,666 + S3 ≈ $12,200 = $16,866/month
Compression Codec Selection for Cost
Compression affects three cost dimensions: CPU (broker), disk, and network. The tradeoff:
| Codec | Disk reduction | CPU cost | Cross-AZ traffic reduction |
|---|---|---|---|
| none | 0% | zero | 0% |
| lz4 | 40–60% | very low | 40–60% |
| zstd | 50–70% | medium | 50–70% |
| gzip | 45–65% | high | 45–65% |
For cost optimization specifically: zstd at level 3 gives the best ratio for the CPU cost on modern brokers. For latency-sensitive topics where CPU is constrained, stick with lz4.
# Topic-level compression setting
kafka-configs --bootstrap-server broker1:9092 \
--alter --entity-type topics --entity-name orders \
--add-config compression.type=zstdSetting compression at the topic level overrides any producer-level setting. This lets you enforce compression even for producers that forgot to set it.
Partition Count and Small-Message Overhead
Each partition is a directory of log segment files. High partition counts with low-throughput topics create many small segments, each of which occupies space and requires metadata tracking.
# Find topics with high partition count and low throughput
kafka-topics --bootstrap-server broker1:9092 --describe \
| grep "PartitionCount" \
| awk '{print $4, $2}' \
| sort -rn \
| head -20Topics with > 100 partitions and < 1 MB/s throughput are candidates for consolidation. Reducing from 100 to 12 partitions on a low-throughput topic eliminates 88 partition directories per broker and reduces controller metadata overhead.
Monitoring Cost Metrics
# Prometheus alerting rules for cost signals
groups:
- name: kafka-cost
rules:
- alert: KafkaDiskUsageHigh
expr: kafka_log_size_bytes / kafka_disk_capacity_bytes > 0.75
for: 10m
labels:
severity: warning
annotations:
summary: "Broker disk above 75% — check retention settings"
- alert: KafkaCompressionIneffective
expr: kafka_server_brokertopicmetrics_compressionratio > 0.9
for: 30m
annotations:
summary: "Low compression ratio — verify compression codec config"Key Takeaways
- Cross-AZ traffic at RF=3 costs 2× your ingress rate in billable network bytes—at $0.01/GB, 100 MB/s ingress generates ~$5,000/month in replication traffic alone before any consumer fetches.
- Retention is a cost dial, not a reliability setting—cutting from 7 days to 2 days reduces disk cost by 3.5×; audit each topic's actual consumer replay window before treating 7-day retention as a requirement.
- Tiered storage reduces long-retention storage cost by 70–75%—it replaces EBS ($0.08–0.10/GB/month) with S3 ($0.023/GB/month) for historical data; the break-even is any topic retaining more than 48 hours.
- Topic-level compression enforcement catches misconfigured producers—setting
compression.typeat the topic overrides producer settings; use zstd level 3 for the best cost-to-CPU tradeoff on throughput-sensitive topics. - Consumer rack awareness eliminates cross-AZ fetch traffic—configuring
client.rackandreplica.selector.classroutes consumers to same-AZ followers; this removes the largest variable cost item after replication. - Low-throughput topics with high partition counts waste metadata resources—consolidate topics below 1 MB/s with partition counts above their consumer parallelism needs; the operational overhead per partition is not free.