Kafka in Production

Incident Library

Ravinder·July 3, 2025·10 min read

KafkaStreamingDistributed SystemsIncident ResponseOperations

Series

Kafka in Production

Part 10 of 10

← Part 9

Cost Optimization

End of series

The best Kafka education is the postmortem you write at 4 AM after an outage you didn't see coming. The second best is reading someone else's. This post is a library of ten production incidents—the kind that appear in real clusters, often in clusters where someone said "we tested this." For each: the failure mode, the signals that appeared before the page, and the response that worked.

This is a reference document. Bookmark it and grep for your symptom.

Incident 1: Disk Full, Cluster Unwritable

What happened: A broker's OS disk filled up because Kafka log segments accumulated in /tmp during a misconfigured log directory migration. New produces to all partitions on that broker returned NotLeaderOrFollowerException; leaders moved to other brokers but the broker itself stopped accepting any traffic.

Signs you missed: Broker BytesInPerSec dropped to zero 20 minutes before the incident. OS disk usage alert was not configured on /tmp—only on the Kafka log dir.

Response:

# 1. Identify the full disk
df -h
# /tmp is at 100%
 
# 2. Find what's consuming space
du -sh /tmp/* | sort -rh | head -10
 
# 3. Remove the temp logs safely (verify they're not active log dirs)
rm -rf /tmp/kafka-logs-migration-temp-*
 
# 4. Verify broker rejoins ISR
kafka-topics --bootstrap-server broker2:9092 \
  --describe --under-replicated-partitions

Prevention: Monitor OS disk usage on ALL partitions, not just the Kafka log directory. Set alerts at 70%, not 90%.

Incident 2: Consumer Group Stuck in Rebalance Loop

What happened: A consumer group with 40 members entered a continuous rebalance loop. Every rebalance took 45 seconds; consumers would rejoin, trigger another rebalance, repeat. Net throughput dropped to zero for 4 hours.

Root cause: One consumer had max.poll.interval.ms=300000 (5 minutes) but average processing time was 280 seconds and occasionally exceeded 300 seconds. The coordinator declared it dead, triggering a rebalance. The slow consumer rejoined, was assigned partitions, processed slowly, was declared dead again.

Signs you missed: kafka.consumer.fetch-manager-metrics.records-lag-max was climbing for 2 hours. ConsumerCoordinator logs showed REBALANCE_IN_PROGRESS every 5 minutes.

Response:

# 1. Identify the slow consumer
kafka-consumer-groups --bootstrap-server broker1:9092 \
  --describe --group order-processor
 
# Look for the member with CLIENT-ID that keeps appearing/disappearing
 
# 2. Immediate mitigation: remove the slow consumer from the group
# (kill the slow instance — it will be reassigned to healthy consumers)
 
# 3. Fix the config
# Reduce max.poll.records from 500 to 50 for the slow processing path
# OR increase max.poll.interval.ms to 600000

Prevention: max.poll.interval.ms must be set to your P99 processing latency × 2, not a guess. Measure processing time per record before setting this value.

Incident 3: Under-Replicated Partitions Cascade

What happened: A network switch failure between AZs caused follower brokers in one AZ to fall behind. Under-replicated partition count climbed from 0 to 1,400 in 3 minutes. With min.insync.replicas=2, producers to affected topics received NotEnoughReplicasException.

Signs you missed: kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions was > 0 for 90 seconds before alert fired. Alert threshold was 5 minutes.

Response:

# 1. Identify affected topics and partitions
kafka-topics --bootstrap-server broker1:9092 \
  --describe --under-replicated-partitions > /tmp/urp.txt
wc -l /tmp/urp.txt
 
# 2. Check if it's network or broker
# If broker is running but replicas are lagging — network issue
# If broker process is dead — broker failure
 
# 3. For network partition: wait for connectivity to restore
# Monitor ReplicaFetcherManager MaxLag
# Do NOT reduce min.insync.replicas under pressure
 
# 4. If broker is dead: restart it
systemctl start kafka
# Allow ISR recovery before marking incident resolved

Prevention: Alert on under-replicated partitions > 0 for more than 60 seconds, not 5 minutes. Investigate immediately—it's always a sign of something wrong.

Incident 4: Log Compaction Stall

What happened: A compacted topic grew to 400 GB and stopped compacting. The log cleaner thread was processing it, but compaction was slower than new data arriving. log.cleaner.dedupe.buffer.size was set to the default 128 MB, which was insufficient for the segment size.

Signs you missed: kafka.log:type=LogCleaner,name=max-clean-time-secs was 3,600+ for days. Nobody was monitoring log cleaner metrics.

# Check log cleaner status
kafka-log-dirs --bootstrap-server broker1:9092 --describe \
  | grep "numLogSegments"
 
# Increase cleaner buffer (requires broker restart)
# server.properties:
# log.cleaner.dedupe.buffer.size=536870912  # 512 MB

Prevention: Monitor log.cleaner.dedupe.buffer.size utilization. For compacted topics > 50 GB, size the buffer to at least 10% of the largest partition's total data size.

Incident 5: Producer Hang on `send()` — Buffer Full

What happened: Producers for a high-volume topic stopped sending. Application threads blocked on producer.send(). buffer-available-bytes was at zero. The topic's consumers had fallen behind, so new data wasn't being fetched, so the broker's socket buffer backed up, so the producer's send buffer filled.

flowchart LR A[Consumer falls behind] --> B[Broker socket buffer fills] B --> C[Broker stops accepting new data] C --> D[Producer send buffer fills] D --> E[producer.send blocks] E --> F[Application threads blocked] F --> G[Application health check fails] G --> H[Kubernetes kills pod] H --> A

Response:

// Short-term: send with timeout to prevent application thread blocking
try {
    Future<RecordMetadata> future = producer.send(record);
    future.get(30, TimeUnit.SECONDS);
} catch (TimeoutException e) {
    // Drop or dead-letter queue the record — don't block
    dlq.send(record);
}

# Fix the consumer lag first — that's the root cause
# Scale up consumer instances
kubectl scale deployment order-consumer --replicas=12

Prevention: Set max.block.ms to a value your application can tolerate blocking. Monitor buffer-available-bytes and alert before it hits zero.

Incident 6: Offset Reset Wipes Processing Progress

What happened: An engineer ran kafka-consumer-groups --reset-offsets --to-earliest on the production consumer group, intending to run it on staging. All consumers rewound to the beginning of the topic (6 weeks of data) and began reprocessing. Downstream database received 6 weeks of duplicate inserts.

Response:

# 1. Immediately pause all consumers
kubectl scale deployment order-processor --replicas=0
 
# 2. Reset offsets back to the correct position
# First: find the offsets before the accidental reset from your monitoring
# Then: reset to a specific datetime
kafka-consumer-groups --bootstrap-server broker1:9092 \
  --reset-offsets --group order-processor \
  --to-datetime 2025-06-18T10:30:00.000 \
  --topic orders --execute
 
# 3. Deduplicate downstream
# Run deduplication query against the database using event timestamps

Prevention: Require --dry-run first on all offset reset operations. Restrict kafka-consumer-groups --reset-offsets to a privileged role in production. Write the environment name in your terminal prompt.

Incident 7: Topic Deletion Leaves Orphaned Consumer Groups

What happened: A topic was deleted via kafka-topics --delete. The consumer groups subscribed to it continued running, received no records, but kept committing offsets to the now-nonexistent topic's __consumer_offsets entries. When the topic was recreated, groups resumed from stale offsets that now pointed past the end of the new (empty) topic—or in the middle.

Response:

# Before deleting a topic, always:
# 1. List all consumer groups
kafka-consumer-groups --bootstrap-server broker1:9092 --list
 
# 2. Check which groups are consuming the topic
kafka-consumer-groups --bootstrap-server broker1:9092 \
  --describe --all-groups | grep "topic-to-delete"
 
# 3. Stop those consumer groups before deleting the topic
# 4. Delete the topic
kafka-topics --bootstrap-server broker1:9092 --delete --topic topic-to-delete
 
# 5. Clean up orphaned consumer group offsets
kafka-consumer-groups --bootstrap-server broker1:9092 \
  --delete-offsets --group orphaned-group --topic topic-to-delete

Incident 8: MirrorMaker 2 Stops Replicating Silently

What happened: MM2 stopped replicating topic orders because a new topic with the same name plus the MM2 prefix (us-east.orders) was accidentally created in the source cluster, causing MM2's cycle detection to exclude orders from replication. Consumers in the DR region fell 18 hours behind before anyone noticed.

Signs you missed: The MM2 consumer group lag metric for orders stopped updating—which looks the same as "caught up with zero lag."

Response:

# Check MM2 replication status
kafka-consumer-groups --bootstrap-server dr-broker:9092 \
  --describe --group mirrormaker2-source-connector
 
# Identify excluded topics
# Check MM2 connector config for topics.exclude pattern
curl http://mm2-connect:8083/connectors/MirrorSourceConnector/config \
  | jq '.["topics.exclude"]'
 
# Delete the accidentally created topic that triggered cycle detection
kafka-topics --bootstrap-server source-broker:9092 \
  --delete --topic us-east.orders

Prevention: Monitor MM2 consumer group lag as a distinct metric from source topic consumer lag. Alert if MM2 lag is non-decreasing for > 10 minutes.

Incident 9: Broker Restart Loop Due to Corrupt Log Segment

What happened: A broker crashed mid-write during a power event. On restart, it failed to recover a corrupt log segment and entered a crash loop: start, attempt recovery, fail, restart.

Response:

# 1. Check broker logs for the corrupt segment
grep "ERROR" /var/log/kafka/server.log | tail -50
# Look for: "Found a corruption" or "InvalidOffsetException"
 
# 2. Identify the corrupt topic-partition
# server.log will show which log file failed
 
# 3. Enable automatic log recovery (use with caution)
# server.properties:
# log.recovery.max.dirty.ratio=0.5
 
# 4. If the segment is unrecoverable, delete it
# The partition will be re-replicated from a healthy follower
rm /var/kafka-logs/orders-3/00000000000001234567.log
rm /var/kafka-logs/orders-3/00000000000001234567.index
 
# 5. Restart the broker — it will fetch the partition from a healthy replica
systemctl start kafka

Prevention: Use unclean.leader.election.enable=false (the default in modern Kafka) to prevent a corrupt replica from becoming a leader. Use RAID or enterprise SSDs for broker log directories.

Incident 10: JVM GC Pause Triggers Rebalance Storm

What happened: All brokers in a cluster experienced concurrent CMS garbage collection pauses of 8–12 seconds. Consumer heartbeats missed, group coordinators declared consumers dead, and all consumer groups rebalanced simultaneously. The cluster processed zero messages for 90 seconds.

Response:

# 1. Verify GC is the cause
grep "GC pause" /var/log/kafka/server.log | tail -20
# Or check JVM GC logs
 
# 2. Switch to G1GC if still on CMS
# KAFKA_JVM_PERFORMANCE_OPTS="-XX:+UseG1GC -XX:MaxGCPauseMillis=200"
 
# 3. Increase consumer session timeout to tolerate broker pauses
# session.timeout.ms=45000 (from default 45000 — already high enough)
# The real fix is reducing GC pause duration
 
# 4. Reduce broker heap if possible — smaller heap = shorter GC
# KAFKA_HEAP_OPTS="-Xmx4G -Xms4G"
# Kafka brokers do not need large heaps; most memory is off-heap (page cache)

Prevention: Kafka brokers should run with G1GC and a heap of 4–8 GB maximum. Larger heaps increase GC pause duration. The OS page cache (off-heap) does most of the work; allocating it to JVM heap hurts performance.

Incident Pattern Summary

graph TD A[Incident Type] --> B[Disk Full] A --> C[Rebalance Storm] A --> D[Under-replicated Partitions] A --> E[Producer Backpressure] A --> F[Operator Error] B --> B1[Fix: Monitor ALL disk paths, alert at 70%] C --> C1[Fix: Tune poll interval, use CooperativeStickyAssignor] D --> D1[Fix: Alert at 60s, never ignore non-zero count] E --> E1[Fix: Fix consumer lag first, set max.block.ms] F --> F1[Fix: Dry-run first, restrict dangerous commands]

Key Takeaways

Under-replicated partitions > 0 for more than 60 seconds is always an incident—it predicts every other category of failure; alert on it immediately and investigate before it cascades.
Consumer lag is the leading indicator for producer backpressure—a blocked producer is almost always downstream of a slow consumer; fix lag before diagnosing the producer.
max.poll.interval.ms must be measured, not guessed—set it to P99 processing latency times two; a timeout that fires under normal load creates rebalance loops that compound the original slowness.
Operator errors (offset resets, topic deletions) cause more downtime than software bugs—enforce dry-run, separate production credentials, and environment labels in terminal prompts.
MirrorMaker 2 silence looks identical to caught-up replication—monitor that MM2 consumer group lag is actively changing, not just zero; a flat-zero that never moves means replication stopped.
Kafka brokers belong at 4–8 GB JVM heap maximum—larger heaps cause longer GC pauses that trigger rebalances; the OS page cache, not the JVM heap, is what makes Kafka fast.

Series

Kafka in Production

Part 10 of 10

← Part 9

Cost Optimization

End of series