Kafka in Production

Cluster Sizing and Disk Math

Ravinder·May 1, 2025·6 min read

KafkaStreamingDistributed SystemsInfrastructureCapacity Planning

Series

Kafka in Production

Part 1 of 10

Start of series

Part 2 →

Partition Strategy

$Cluster Sizing and Disk Math$

Every Kafka outage I've investigated had a root cause that traces back to one decision made on day one: the team guessed at their cluster size. Not estimated—guessed. They picked three brokers because a blog post said three brokers, and they set retention to seven days because it sounded reasonable. Six months later, disks are at 95%, replication is lagging, and the on-call engineer is deleting topics at 2 AM.

Sizing a Kafka cluster isn't magic. It's arithmetic. Unglamorous, unforgiving arithmetic.

Start With Throughput, Not Broker Count

The instinct is to ask "how many brokers do I need?" That's the wrong first question. Start with your ingress rate in bytes per second, then let the math tell you broker count.

Work through this sequence:

Raw ingress rate — How many MB/s are producers writing across all topics?
Replication multiplier — Multiply by your replication factor (almost always 3).
Retention window — How many seconds of data do you need to keep?
Disk safety margin — Never fill past 70%. Use a 1.43× multiplier.

disk_per_broker = (ingress_rate_MB_s × replication_factor × retention_seconds) / broker_count / 0.70

If you flip it around to solve for minimum broker count:

min_brokers = ceil(ingress_rate_MB_s × replication_factor × retention_seconds / target_disk_per_broker_MB / 0.70)

Run this with real numbers. A team ingesting 200 MB/s with RF=3 and 7-day retention on 2 TB disks:

200 MB/s × 3 × 604800s = 362,880,000 MB total data
362,880,000 / 2,000,000 MB per broker / 0.70 = 259 brokers

That's not a real answer—it means your retention policy is the actual problem, not your cluster design.

The Retention vs. Broker Tradeoff

Most teams treat retention as a fixed requirement. It isn't. Retention is a cost dial.

graph LR A[Ingress Rate] --> B{Bottleneck?} B -->|Disk| C[Reduce Retention OR Add Brokers OR Enable Tiered Storage] B -->|Network| D[Add Brokers OR Reduce RF] B -->|CPU| E[Add Brokers OR Change Compression Codec] C --> F[Target: 70% disk utilization] D --> F E --> F

Before adding brokers, ask whether consumers actually need 7 days. In most pipelines, downstream systems are S3, a data warehouse, or another persistent store. Kafka is the transport, not the archive. 24–48 hours of retention covers re-processing most failure scenarios. Dropping from 7 days to 2 days cuts disk requirements by 3.5×.

Replication Overhead Is Not Just Storage

Teams calculate storage overhead for replication but forget network overhead. With RF=3, every byte written by a producer travels across the network three times total: once from producer to leader, once from leader to follower-1, once from leader to follower-2.

Your inter-broker network must handle:

inter_broker_bandwidth = ingress_rate × (replication_factor - 1)

At 200 MB/s ingress with RF=3, inter-broker replication alone consumes 400 MB/s of internal bandwidth. This is separate from consumer fetch bandwidth. On AWS, a 10 Gbps instance type gives you roughly 1,250 MB/s. Fill it with one busy topic and you have nothing left.

# Broker config: throttle replication to protect consumer bandwidth
confluent.rebalancer.replica.fetch.throttle.bytes: 50000000   # 50 MB/s
replica.fetch.max.bytes: 10485760
num.replica.fetchers: 4

Broker Count: The Partition Constraint

Network and disk get you a floor. Partitions give you a ceiling concern. Each partition has exactly one leader at a time. A broker handles leadership for many partitions, but saturating a single broker with too many partition leaders creates hotspots.

Rough production guidelines:

Max 2,000–4,000 partitions per broker (including replicas)
Max 50,000 partitions per cluster if running ZooKeeper (KRaft removes this in practice)
Keep leaders balanced within ±10% across brokers

flowchart TD A[Total Partitions Needed] --> B[÷ Max Partitions Per Broker] B --> C[Broker Count from Partitions] D[Disk Math] --> E[Broker Count from Disk] F[Network Math] --> G[Broker Count from Network] C --> H[Take the MAX] E --> H G --> H H --> I[Add 1 for Rolling Upgrade Headroom] I --> J[Final Broker Count]

Take the maximum across all three constraints, then add one. That extra broker is your rolling-upgrade headroom. Without it, taking any broker offline for patching immediately degrades your cluster below capacity.

Practical Sizing Example

A realistic e-commerce event pipeline:

Topics: order-created, inventory-update, user-activity (3 topics)
Ingress: 50 MB/s combined
Replication factor: 3
Retention: 48 hours
Target disk per broker: 2 TB (80% usable = 1.6 TB)

import math
 
ingress_mbps = 50
rf = 3
retention_seconds = 48 * 3600
target_disk_mb = 2_000_000 * 0.70  # 70% of 2TB
 
total_data_mb = ingress_mbps * rf * retention_seconds
min_brokers_disk = math.ceil(total_data_mb / target_disk_mb)
 
inter_broker_mbps = ingress_mbps * (rf - 1)
# Assume 500 MB/s usable per broker after OS/consumer share
min_brokers_net = math.ceil(inter_broker_mbps / 500)
 
min_brokers = max(min_brokers_disk, min_brokers_net) + 1
print(f"Total data: {total_data_mb / 1_000_000:.1f} TB")
print(f"Brokers (disk): {min_brokers_disk}")
print(f"Brokers (net):  {min_brokers_net}")
print(f"Recommended:    {min_brokers}")

Output:

Total data: 25.9 TB
Brokers (disk): 37
Brokers (net):  1
Recommended:    38

That number surprises people. 38 brokers for 50 MB/s? Yes—because 48-hour retention at RF=3 accumulates fast. This is precisely where tiered storage (covered in post 9) changes the economics dramatically.

Monitoring the Math Post-Launch

Once your cluster is live, watch these metrics—not CPU:

Metric	Healthy Range	Alert Threshold
`kafka.log.Log.Size`	< 70% capacity	> 80%
`ReplicaFetcherManager.MaxLag`	< 10,000 msgs	> 100,000
`kafka.server.BrokerTopicMetrics.BytesInPerSec`	Monitor trend	> 80% NIC
Under-replicated partitions	0	> 0 for > 30s

Under-replicated partitions are your most important signal. A non-zero count means a follower fell behind the leader—usually the first sign that disk or network is saturated.

Key Takeaways

Start with throughput math, not broker count intuition—calculate ingress rate × replication factor × retention seconds, then divide by target disk per broker at 70% capacity.
Replication multiplies network load, not just storage—inter-broker replication at RF=3 costs 2× your ingress rate in internal bandwidth.
Retention is a cost dial—cutting retention from 7 days to 2 days reduces disk needs by 3.5×; do this before buying more hardware.
Broker count is the maximum of three independent constraints: disk math, network math, and partition ceiling—take the max, then add one for rolling-upgrade headroom.
Under-replicated partitions are your primary health signal—a non-zero count for more than 30 seconds means something is wrong with throughput or availability.
Tiered storage fundamentally changes the economics—if your sizing math produces a broker count that feels absurd, tiered storage is likely the right lever, not more hardware.

Series

Kafka in Production

Part 1 of 10

Start of series

Part 2 →

Partition Strategy