Cluster Sizing and Disk Math
Series
Kafka in ProductionPart 2 →
Partition Strategy
Every Kafka outage I've investigated had a root cause that traces back to one decision made on day one: the team guessed at their cluster size. Not estimated—guessed. They picked three brokers because a blog post said three brokers, and they set retention to seven days because it sounded reasonable. Six months later, disks are at 95%, replication is lagging, and the on-call engineer is deleting topics at 2 AM.
Sizing a Kafka cluster isn't magic. It's arithmetic. Unglamorous, unforgiving arithmetic.
Start With Throughput, Not Broker Count
The instinct is to ask "how many brokers do I need?" That's the wrong first question. Start with your ingress rate in bytes per second, then let the math tell you broker count.
Work through this sequence:
- Raw ingress rate — How many MB/s are producers writing across all topics?
- Replication multiplier — Multiply by your replication factor (almost always 3).
- Retention window — How many seconds of data do you need to keep?
- Disk safety margin — Never fill past 70%. Use a 1.43× multiplier.
disk_per_broker = (ingress_rate_MB_s × replication_factor × retention_seconds) / broker_count / 0.70If you flip it around to solve for minimum broker count:
min_brokers = ceil(ingress_rate_MB_s × replication_factor × retention_seconds / target_disk_per_broker_MB / 0.70)Run this with real numbers. A team ingesting 200 MB/s with RF=3 and 7-day retention on 2 TB disks:
200 MB/s × 3 × 604800s = 362,880,000 MB total data
362,880,000 / 2,000,000 MB per broker / 0.70 = 259 brokersThat's not a real answer—it means your retention policy is the actual problem, not your cluster design.
The Retention vs. Broker Tradeoff
Most teams treat retention as a fixed requirement. It isn't. Retention is a cost dial.
Before adding brokers, ask whether consumers actually need 7 days. In most pipelines, downstream systems are S3, a data warehouse, or another persistent store. Kafka is the transport, not the archive. 24–48 hours of retention covers re-processing most failure scenarios. Dropping from 7 days to 2 days cuts disk requirements by 3.5×.
Replication Overhead Is Not Just Storage
Teams calculate storage overhead for replication but forget network overhead. With RF=3, every byte written by a producer travels across the network three times total: once from producer to leader, once from leader to follower-1, once from leader to follower-2.
Your inter-broker network must handle:
inter_broker_bandwidth = ingress_rate × (replication_factor - 1)At 200 MB/s ingress with RF=3, inter-broker replication alone consumes 400 MB/s of internal bandwidth. This is separate from consumer fetch bandwidth. On AWS, a 10 Gbps instance type gives you roughly 1,250 MB/s. Fill it with one busy topic and you have nothing left.
# Broker config: throttle replication to protect consumer bandwidth
confluent.rebalancer.replica.fetch.throttle.bytes: 50000000 # 50 MB/s
replica.fetch.max.bytes: 10485760
num.replica.fetchers: 4Broker Count: The Partition Constraint
Network and disk get you a floor. Partitions give you a ceiling concern. Each partition has exactly one leader at a time. A broker handles leadership for many partitions, but saturating a single broker with too many partition leaders creates hotspots.
Rough production guidelines:
- Max 2,000–4,000 partitions per broker (including replicas)
- Max 50,000 partitions per cluster if running ZooKeeper (KRaft removes this in practice)
- Keep leaders balanced within ±10% across brokers
Take the maximum across all three constraints, then add one. That extra broker is your rolling-upgrade headroom. Without it, taking any broker offline for patching immediately degrades your cluster below capacity.
Practical Sizing Example
A realistic e-commerce event pipeline:
- Topics: order-created, inventory-update, user-activity (3 topics)
- Ingress: 50 MB/s combined
- Replication factor: 3
- Retention: 48 hours
- Target disk per broker: 2 TB (80% usable = 1.6 TB)
import math
ingress_mbps = 50
rf = 3
retention_seconds = 48 * 3600
target_disk_mb = 2_000_000 * 0.70 # 70% of 2TB
total_data_mb = ingress_mbps * rf * retention_seconds
min_brokers_disk = math.ceil(total_data_mb / target_disk_mb)
inter_broker_mbps = ingress_mbps * (rf - 1)
# Assume 500 MB/s usable per broker after OS/consumer share
min_brokers_net = math.ceil(inter_broker_mbps / 500)
min_brokers = max(min_brokers_disk, min_brokers_net) + 1
print(f"Total data: {total_data_mb / 1_000_000:.1f} TB")
print(f"Brokers (disk): {min_brokers_disk}")
print(f"Brokers (net): {min_brokers_net}")
print(f"Recommended: {min_brokers}")Output:
Total data: 25.9 TB
Brokers (disk): 37
Brokers (net): 1
Recommended: 38That number surprises people. 38 brokers for 50 MB/s? Yes—because 48-hour retention at RF=3 accumulates fast. This is precisely where tiered storage (covered in post 9) changes the economics dramatically.
Monitoring the Math Post-Launch
Once your cluster is live, watch these metrics—not CPU:
| Metric | Healthy Range | Alert Threshold |
|---|---|---|
kafka.log.Log.Size |
< 70% capacity | > 80% |
ReplicaFetcherManager.MaxLag |
< 10,000 msgs | > 100,000 |
kafka.server.BrokerTopicMetrics.BytesInPerSec |
Monitor trend | > 80% NIC |
| Under-replicated partitions | 0 | > 0 for > 30s |
Under-replicated partitions are your most important signal. A non-zero count means a follower fell behind the leader—usually the first sign that disk or network is saturated.
Key Takeaways
- Start with throughput math, not broker count intuition—calculate ingress rate × replication factor × retention seconds, then divide by target disk per broker at 70% capacity.
- Replication multiplies network load, not just storage—inter-broker replication at RF=3 costs 2× your ingress rate in internal bandwidth.
- Retention is a cost dial—cutting retention from 7 days to 2 days reduces disk needs by 3.5×; do this before buying more hardware.
- Broker count is the maximum of three independent constraints: disk math, network math, and partition ceiling—take the max, then add one for rolling-upgrade headroom.
- Under-replicated partitions are your primary health signal—a non-zero count for more than 30 seconds means something is wrong with throughput or availability.
- Tiered storage fundamentally changes the economics—if your sizing math produces a broker count that feels absurd, tiered storage is likely the right lever, not more hardware.
Series
Kafka in ProductionPart 2 →
Partition Strategy