Kafka in Production

Producer Tuning and Idempotency

Ravinder·May 15, 2025·6 min read

KafkaStreamingDistributed SystemsProducerPerformance

Series

Kafka in Production

Part 3 of 10

← Part 2

Partition Strategy

Part 4 →

Consumer Group Patterns

Most Kafka performance problems I've seen in the field weren't broker problems—they were producer problems. Teams copy a hello-world config, ship it to production, and wonder why throughput is half what they expected or why they're seeing duplicate messages under retry. The producer is where the performance and reliability story actually starts.

This post works through the four knobs that matter most: acknowledgment semantics, batching, compression, and idempotency. Each one has a real tradeoff, and the defaults are almost never right for production.

Acknowledgment Semantics: What `acks` Actually Means

acks is the single most consequential producer setting. It controls when the broker tells the producer "I got it."

acks=0 — Fire and forget. No confirmation. Fastest, no durability.
acks=1 — Leader confirms write. Follower replication is async. Leader failure before replication = lost message.
acks=all (or -1) — Leader waits for all in-sync replicas (ISR) to confirm. No message loss as long as at least one ISR survives.

sequenceDiagram participant P as Producer participant L as Leader Broker participant F1 as Follower 1 participant F2 as Follower 2 P->>L: Produce (acks=all) L->>F1: Replicate L->>F2: Replicate F1-->>L: Ack F2-->>L: Ack L-->>P: Ack (all ISR confirmed) note over L,F2: min.insync.replicas=2 means
at least 2 replicas must confirm

acks=all without min.insync.replicas set at the topic or broker level is a trap. If your ISR shrinks to one broker (the leader itself), acks=all still succeeds—you just confirmed to yourself. Set min.insync.replicas=2 to prevent phantom durability:

# Broker-level default (server.properties)
min.insync.replicas=2
default.replication.factor=3

// Producer config
Properties props = new Properties();
props.put(ProducerConfig.ACKS_CONFIG, "all");
props.put(ProducerConfig.RETRIES_CONFIG, Integer.MAX_VALUE);
props.put(ProducerConfig.MAX_IN_FLIGHT_REQUESTS_PER_CONNECTION, 5);
// With idempotence enabled, ordering is preserved even with retries
props.put(ProducerConfig.ENABLE_IDEMPOTENCE_CONFIG, true);

Batching: The Throughput Multiplier

The Kafka producer buffers records into batches before sending. Larger batches mean fewer network round trips and better compression ratios—but higher latency. Two settings control this:

batch.size — Maximum batch size in bytes. Default is 16 KB. For high-throughput topics, 256 KB or 512 KB is common.
linger.ms — How long to wait for a batch to fill before sending anyway. Default is 0 (send immediately). Setting this to 5–20 ms gives the producer time to accumulate records.

props.put(ProducerConfig.BATCH_SIZE_CONFIG, 262144);      // 256 KB
props.put(ProducerConfig.LINGER_MS_CONFIG, 10);            // wait up to 10ms
props.put(ProducerConfig.BUFFER_MEMORY_CONFIG, 67108864);  // 64 MB total buffer

from confluent_kafka import Producer
 
producer = Producer({
    "bootstrap.servers": "broker1:9092,broker2:9092",
    "acks": "all",
    "batch.size": 262144,
    "linger.ms": 10,
    "compression.type": "lz4",
    "enable.idempotence": True,
    "retries": 2147483647,
    "max.in.flight.requests.per.connection": 5,
})

The interaction between linger.ms and batch.size: whichever threshold is hit first triggers a send. At high throughput batch.size fills before linger.ms expires. At low throughput linger.ms is what triggers the send.

flowchart LR A[Record arrives] --> B[Add to batch] B --> C{batch.size reached?} C -->|Yes| E[Send batch immediately] C -->|No| D{linger.ms expired?} D -->|Yes| E D -->|No| B E --> F[Network send to leader]

Compression: Always On, Right Codec

Compression is almost always worth enabling. The CPU cost on producer and broker is low compared to the network and disk savings. At RF=3 with lz4, you can cut wire bytes by 40–70% on typical JSON payloads.

Codec comparison for Kafka workloads:

Codec	Ratio	CPU Cost	Best For
none	1.0×	none	Already-binary data
gzip	3–5×	High	Archival, batch
snappy	2–3×	Low	General use
lz4	2–4×	Very low	High-throughput, latency-sensitive
zstd	3–6×	Medium	Best ratio at acceptable cost

lz4 is the default choice for most real-time pipelines. zstd is worth evaluating if you're disk-constrained and can afford slightly more CPU.

props.put(ProducerConfig.COMPRESSION_TYPE_CONFIG, "lz4");

One gotcha: compression happens at the batch level, not per-record. A batch with one record compresses poorly. This is another reason linger.ms > 0 matters—more records per batch means better compression ratios.

Idempotent Producer: Exactly-Once at the Producer Level

Without idempotency, producer retries can create duplicates. The sequence: producer sends a batch, the broker appends it, the network drops the ack, the producer retries, the broker appends a duplicate. With enable.idempotence=true, the broker tracks a producer ID (PID) and sequence number per partition and deduplicates silently.

// Minimum correct config for idempotent producer
Properties props = new Properties();
props.put(ProducerConfig.ENABLE_IDEMPOTENCE_CONFIG, true);
// These are set automatically when idempotence is enabled,
// but being explicit is clearer:
props.put(ProducerConfig.ACKS_CONFIG, "all");
props.put(ProducerConfig.RETRIES_CONFIG, Integer.MAX_VALUE);
props.put(ProducerConfig.MAX_IN_FLIGHT_REQUESTS_PER_CONNECTION, 5);

What idempotency does NOT cover:

It's scoped to a single producer session. If the producer process restarts, it gets a new PID. Records written before the restart can still be duplicated on retry.
It does not coordinate across multiple producer instances writing to the same topic.
It does not span multiple topics (that requires transactions, covered in post 6).

sequenceDiagram participant P as Producer (PID=42) participant B as Broker P->>B: Batch (PID=42, seq=0) B->>B: Append, record seq=0 B--xP: Ack lost (network) P->>B: Retry (PID=42, seq=0) B->>B: Seq=0 already seen — deduplicate B-->>P: Ack (no duplicate written)

Putting It Together: Production Template

public KafkaProducer<String, byte[]> buildProducer(String bootstrapServers) {
    Properties props = new Properties();
    props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, bootstrapServers);
    props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG,
              StringSerializer.class.getName());
    props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG,
              ByteArraySerializer.class.getName());
 
    // Durability
    props.put(ProducerConfig.ACKS_CONFIG, "all");
    props.put(ProducerConfig.ENABLE_IDEMPOTENCE_CONFIG, true);
 
    // Throughput
    props.put(ProducerConfig.BATCH_SIZE_CONFIG, 262144);
    props.put(ProducerConfig.LINGER_MS_CONFIG, 10);
    props.put(ProducerConfig.COMPRESSION_TYPE_CONFIG, "lz4");
    props.put(ProducerConfig.BUFFER_MEMORY_CONFIG, 67108864L);
 
    // Resilience
    props.put(ProducerConfig.RETRIES_CONFIG, Integer.MAX_VALUE);
    props.put(ProducerConfig.DELIVERY_TIMEOUT_MS_CONFIG, 120000);
    props.put(ProducerConfig.REQUEST_TIMEOUT_MS_CONFIG, 30000);
    props.put(ProducerConfig.MAX_BLOCK_MS_CONFIG, 60000);
 
    return new KafkaProducer<>(props);
}

Monitor these producer-side metrics in production:

Metric	Healthy	Investigate
`record-error-rate`	0	> 0
`record-retry-rate`	< 0.1%	> 1%
`batch-size-avg`	Close to `batch.size`	Much smaller (linger too low)
`compression-rate-avg`	< 0.6 for lz4	> 0.9 (compression not working)
`buffer-available-bytes`	> 20%	< 10% (producer backpressure)

Key Takeaways

acks=all without min.insync.replicas=2 is a false guarantee—a shrunk ISR lets the leader ack to itself, and you lose the durability you thought you had.
linger.ms=10 is almost always better than the default 0—it costs 10ms of latency in exchange for larger batches, better compression, and meaningfully higher throughput.
lz4 compression is nearly free CPU-wise and typically saves 40–70% on wire bytes—turn it on by default; the only exception is already-compressed binary payloads.
Idempotent producer eliminates retry-induced duplicates within a single producer session—it does not survive process restarts and does not coordinate across multiple producers.
retries=MAX_VALUE combined with delivery.timeout.ms is safer than a fixed retry count—you set the deadline, not the number of attempts, and let the producer decide how many fits.
Batch fill rate is your throughput signal—if batch-size-avg is tiny compared to batch.size, your linger window is too short or your throughput is too low to justify the current batch size.

Series

Kafka in Production

Part 3 of 10

← Part 2

Partition Strategy

Part 4 →

Consumer Group Patterns

Acknowledgment Semantics: What acks Actually Means

Batching: The Throughput Multiplier

Compression: Always On, Right Codec

Idempotent Producer: Exactly-Once at the Producer Level

Putting It Together: Production Template

Key Takeaways

Acknowledgment Semantics: What `acks` Actually Means