Kafka in Production

Upgrades Without Downtime

Ravinder·June 19, 2025·6 min read

KafkaStreamingDistributed SystemsOperationsKRaft

Series

Kafka in Production

Part 8 of 10

← Part 7

Multi-Region Replication

Part 9 →

Cost Optimization

Upgrading a Kafka cluster is the operation most teams delay the longest. The cluster works, consumers are happy, nobody wants to touch it. Then a security advisory forces a patch, or a new feature in the next major version becomes critical, and the team discovers they're four minor versions behind with no tested upgrade procedure.

Rolling upgrades in Kafka are genuinely safe when done correctly. The protocol is designed for it. The risk comes from skipping steps, not understanding version skew rules, or attempting the KRaft migration without a rollback plan.

Why Rolling Upgrades Work

Kafka's inter-broker protocol is versioned. Brokers negotiate the protocol version they share during cluster startup. When you upgrade broker by broker, the cluster temporarily runs a mix of old and new brokers—and that's fine, as long as you follow the version skew rules.

flowchart LR subgraph Before Upgrade B0v30[Broker 0: 3.0] B1v30[Broker 1: 3.0] B2v30[Broker 2: 3.0] end subgraph Mid-Upgrade B0v31[Broker 0: 3.1] B1v30b[Broker 1: 3.0] B2v30b[Broker 2: 3.0] end subgraph After Upgrade B0f[Broker 0: 3.1] B1f[Broker 1: 3.1] B2f[Broker 2: 3.1] end Before Upgrade --> Mid-Upgrade --> After Upgrade

The version skew rule: brokers in the same cluster must be within one major version of each other. You cannot run Kafka 2.8 and 3.5 in the same cluster. You can run 3.4 and 3.5. For major version jumps, you must upgrade through intermediate versions.

Pre-Upgrade Checklist

Before touching a single broker:

# 1. Verify no under-replicated partitions
kafka-topics --bootstrap-server broker1:9092 \
  --describe --under-replicated-partitions
# Expected output: empty
 
# 2. Check broker log dirs are healthy
kafka-log-dirs --bootstrap-server broker1:9092 \
  --describe --topic-list orders,inventory
# Look for any "error" entries
 
# 3. Record current consumer group lag baselines
kafka-consumer-groups --bootstrap-server broker1:9092 \
  --describe --all-groups > pre-upgrade-lag-baseline.txt
 
# 4. Verify client versions — old clients talking to new brokers
kafka-broker-api-versions --bootstrap-server broker1:9092

If under-replicated partitions exist before the upgrade, stop. An upgrade takes brokers offline temporarily; if partitions are already under-replicated, taking a broker offline may make some partitions unavailable.

The Rolling Upgrade Procedure

# For each broker (one at a time):
 
# Step 1: Trigger a controlled leader election away from this broker
kafka-leader-election --bootstrap-server broker2:9092 \
  --election-type preferred \
  --topic orders --partition 0
# Or for all partitions:
kafka-leader-election --bootstrap-server broker2:9092 \
  --election-type preferred --all-topic-partitions
 
# Step 2: Graceful shutdown (allows leader transfer, log flush)
kafka-server-stop.sh
# Or if using systemd:
systemctl stop kafka
 
# Step 3: Wait for the broker to fully stop
# Check that the process is gone and log shows "shut down completed"
 
# Step 4: Update the binary and config
# Replace kafka binary, update server.properties if needed
 
# Step 5: Start the broker
systemctl start kafka
 
# Step 6: Wait for the broker to rejoin the ISR
kafka-topics --bootstrap-server broker2:9092 \
  --describe --under-replicated-partitions
# Wait until this returns empty before proceeding to the next broker

The pause at Step 6 is not optional. If you proceed before the upgraded broker has fully caught up and joined the ISR, you reduce fault tolerance during the upgrade window. With RF=3 and two brokers simultaneously offline or catching up, you're one failure away from partition unavailability.

Version Skew Between Brokers and Clients

Brokers are backward compatible with older clients within a supported window. The general rule: clients should be no more than two major versions behind the brokers.

graph LR C24[Client: Kafka 2.4\nclient API version 4] C31[Client: Kafka 3.1\nclient API version 12] B35[Broker: Kafka 3.5] C24 -->|works: within 2 major versions| B35 C31 -->|works: same major| B35

After upgrading all brokers, upgrade clients in this order:

Consumers first (they are purely readers; upgrading them has zero risk).
Producers second.
Admin tools last.

Never upgrade producers before consumers when you're also changing message format or schema. Consumers must be able to read the new format before producers start writing it.

Inter-Broker Protocol and Log Message Format

After all brokers are upgraded, you must explicitly bump the inter-broker protocol version and the log message format version. These settings allow the cluster to use new features:

# server.properties — update after ALL brokers are on new version
inter.broker.protocol.version=3.5
log.message.format.version=3.5

This is a two-phase process by design. During the rolling upgrade, the cluster runs the old protocol version even though brokers are on the new binary. After all brokers are upgraded, you bump the protocol version. This is also your rollback boundary: if you need to downgrade, you can as long as you haven't bumped these versions.

# Verify protocol version across all brokers after bump
kafka-configs --bootstrap-server broker1:9092 \
  --describe --broker 0 | grep "inter.broker.protocol.version"

KRaft Migration: ZooKeeper to Quorum-Based Metadata

KRaft (Kafka without ZooKeeper) is production-ready from Kafka 3.3. The migration from ZooKeeper-based clusters requires running in a hybrid mode temporarily.

flowchart TD A[Phase 0: ZooKeeper-based cluster] --> B[Phase 1: Add KRaft controller quorum\n3 new controller-only nodes] B --> C[Phase 2: Migrate metadata\nRun kafka-storage tool] C --> D[Phase 3: Brokers rejoin under KRaft\nBroker by broker] D --> E[Phase 4: Remove ZooKeeper\nDecommission ZK nodes] style A fill:#fff3e0 style B fill:#e3f2fd style C fill:#e3f2fd style D fill:#e8f5e9 style E fill:#e8f5e9

# Phase 1: Generate a cluster ID (only once)
KAFKA_CLUSTER_ID=$(kafka-storage random-uuid)
 
# Format KRaft controller storage
kafka-storage format \
  --config controller.properties \
  --cluster-id $KAFKA_CLUSTER_ID
 
# Phase 2: Start KRaft controllers in observer mode (alongside ZK)
# controller.properties must include:
# process.roles=controller
# controller.quorum.voters=1@ctrl1:9093,2@ctrl2:9093,3@ctrl3:9093
# zookeeper.connect=zk1:2181,zk2:2181  (still connected during migration)
 
# Phase 3: Trigger metadata migration
kafka-features --bootstrap-server broker1:9092 \
  --upgrade-release-version 3.5

The KRaft migration is not reversible once brokers are migrated off ZooKeeper. Run it in a staging environment first. Have a full cluster snapshot before starting.

Post-Upgrade Validation

# Confirm all brokers on new version
kafka-broker-api-versions --bootstrap-server broker1:9092 | grep "version"
 
# Confirm no under-replicated partitions
kafka-topics --bootstrap-server broker1:9092 \
  --describe --under-replicated-partitions
 
# Compare consumer group lag against baseline
kafka-consumer-groups --bootstrap-server broker1:9092 \
  --describe --all-groups > post-upgrade-lag.txt
diff pre-upgrade-lag-baseline.txt post-upgrade-lag.txt

Key Takeaways

Never start an upgrade with under-replicated partitions—taking a broker offline reduces the ISR further; resolve all under-replication before touching the first broker.
The inter-broker protocol version is your rollback boundary—once bumped, you cannot downgrade; keep it at the old version until all brokers are upgraded and validated.
Wait for ISR recovery between each broker upgrade—proceeding before the ISR is restored after each broker reduces your fault tolerance and turns a one-broker failure into a potential outage.
Upgrade consumers before producers when changing message format—consumers must be able to read the new format before producers start writing it; reversing this order causes consumer failures.
KRaft migration is one-way—run it in staging, take a full cluster snapshot before starting, and verify the controller quorum is healthy before migrating broker processes.
Version skew between clients and brokers is bounded—clients more than two major versions behind the broker may encounter unsupported API versions; audit client versions before upgrading brokers.

Series

Kafka in Production

Part 8 of 10

← Part 7

Multi-Region Replication

Part 9 →

Cost Optimization