Event-Driven Architecture

Replay and Reprocessing

Ravinder·October 6, 2025·6 min read

Event-Driven ArchitectureDistributed SystemsArchitectureKafkaEvent Sourcing

Series

Event-Driven Architecture

Part 6 of 10

← Part 5

Outbox and Inbox

Part 7 →

Idempotency, End to End

One of the most powerful properties of event-driven systems built on durable brokers is the ability to replay history. A consumer can reset its offset to the beginning of a topic and reprocess every event as if it were seeing them for the first time. This sounds simple. In practice, replaying events safely in a production system—without corrupting live state, firing duplicate notifications, or overwhelming downstream services—requires careful design.

This post covers when replay is the right tool, how to execute it safely, and the projection rebuild story for event-sourced systems.

When Replay Is Appropriate

Bug fix recovery: a consumer had a bug for two weeks and produced incorrect output. Fix the bug, reset the offset, replay the events, produce correct output. This is replay's clearest use case.

New consumer bootstrap: you add a new analytics service that needs to understand the full history of orders. Rather than backfilling from a database dump, you replay from the beginning of the orders.placed topic.

Projection rebuild: in event-sourced systems, the read model (projection) is derived by processing all events in order. When the projection schema changes, you rebuild from scratch by replaying all events.

Schema migration testing: replay events through a new version of a consumer to verify correctness before switching production traffic.

What Makes Replay Hard

The naive assumption is that replay is just "reset the offset and run." The problems emerge in three areas.

Side effects are not idempotent by default. If your consumer sends an email for every OrderShipped event, replaying 50,000 historical events sends 50,000 emails. Notifications, payment initiations, external API calls—all of these must be suppressed during replay or made inherently idempotent.

Downstream systems see duplicate writes. A consumer that writes to a database during replay may produce conflicting writes with the live consumer running in parallel. You need a strategy for managing the replay consumer's output independently.

Ordering guarantees break under parallelism. Replaying with higher parallelism than the original processing changes the effective processing order within a partition. If your business logic is order-sensitive, replay parallelism must match original parallelism or you must process single-threaded.

Offset Management

In Kafka, consumer group offsets are stored in the __consumer_offsets topic. Resetting them changes where the consumer group starts reading.

# Stop the consumer group first
# Then reset to beginning
kafka-consumer-groups.sh \
  --bootstrap-server kafka:9092 \
  --group inventory-service \
  --topic orders.placed \
  --reset-offsets \
  --to-earliest \
  --execute
 
# Or reset to a specific timestamp
kafka-consumer-groups.sh \
  --bootstrap-server kafka:9092 \
  --group inventory-service \
  --topic orders.placed \
  --reset-offsets \
  --to-datetime 2025-09-15T00:00:00.000 \
  --execute

Never reset offsets on a running consumer group. The group must be stopped before offset changes take effect.

sequenceDiagram participant Ops as Operator participant CG as Consumer Group participant Kafka as Kafka Broker participant DB as Output DB Ops->>CG: Stop consumer group Ops->>Kafka: Reset offset to earliest Ops->>DB: Truncate/snapshot replay output table Ops->>CG: Start consumer group (replay mode) CG->>Kafka: Fetch from offset 0 Kafka-->>CG: Events [0..N] CG->>DB: Write idempotently Note over CG: Replay complete at current end offset Ops->>CG: Switch to live mode

Replay Mode Pattern

Design consumers with explicit replay awareness. A REPLAY_MODE flag changes behavior:

import os
from enum import Enum
 
class ConsumerMode(Enum):
    LIVE = "live"
    REPLAY = "replay"
 
CONSUMER_MODE = ConsumerMode(os.getenv("CONSUMER_MODE", "live"))
 
def handle_order_shipped(event: dict) -> None:
    update_fulfillment_record(event)           # Always run
    
    if CONSUMER_MODE == ConsumerMode.LIVE:
        send_customer_notification(event)      # Live only
        call_warehouse_api(event)              # Live only
    else:
        # Replay: skip external side effects
        logger.info(f"[REPLAY] Skipping notification for {event['orderId']}")

The replay mode writes to a shadow database or a replay-specific table. After replay completes and output is verified, you promote the shadow state to primary.

Projection Rebuild

Event sourcing relies on projections—read models derived from the event stream. When business requirements change, you change the projection logic and rebuild.

graph TD subgraph "Event Store" E1[OrderPlaced] --> E2[OrderShipped] E2 --> E3[OrderDelivered] E3 --> E4[OrderReturned] end subgraph "Rebuild" P[Projection Builder] -->|reads all events| E1 P --> New[(New Projection DB)] end subgraph "Swap" Old[(Old Projection DB)] -->|deprecated| X New -->|promoted| Live[Live Read Model] end

A safe rebuild sequence:

Deploy new projection builder with new schema
Build into a new table (orders_projection_v2)
Once caught up to current offset, take a snapshot
Switch API reads to the new table (blue-green or feature flag)
Deprecate old table

def rebuild_projection():
    consumer = KafkaConsumer(
        "orders",
        group_id="projection-builder-v2",  # New group ID — doesn't affect live consumer
        auto_offset_reset="earliest",
        enable_auto_commit=False
    )
    
    target_table = "orders_projection_v2"
    db.execute(f"TRUNCATE {target_table}")
    
    for message in consumer:
        event = json.loads(message.value)
        apply_event_to_projection(event, target_table)
        consumer.commit()
        
        # Check if we've caught up to end of topic
        if is_caught_up(consumer, message):
            logger.info("Projection rebuild complete")
            break

Use a separate consumer group for the rebuild so it does not affect the live consumer group's offsets.

Throttling Replay

Replaying millions of events at full speed competes with live consumers for broker resources and can overwhelm downstream databases.

REPLAY_RATE_LIMIT = int(os.getenv("REPLAY_EVENTS_PER_SECOND", "500"))
 
class ThrottledReplayConsumer:
    def __init__(self):
        self.processed = 0
        self.window_start = time.time()
    
    def process_with_throttle(self, event):
        self.process(event)
        self.processed += 1
        
        elapsed = time.time() - self.window_start
        expected_elapsed = self.processed / REPLAY_RATE_LIMIT
        
        if expected_elapsed > elapsed:
            time.sleep(expected_elapsed - elapsed)

Start with a conservative rate, monitor broker and database metrics, and increase if headroom exists.

The Retention Window Constraint

Replay is only possible within the broker's retention window. If Kafka retains 7 days and you need to replay 30 days of events, you cannot—the early events are gone.

Design your retention policy based on your replay requirements, not just storage costs. For systems where projection rebuild is a real operational need, 30–90 days of retention is common. Use Kafka's tiered storage or log compaction for cost management while extending retention.

Key Takeaways

Replay is a first-class operational tool for bug recovery, new consumer bootstrap, and projection rebuilds—not a last resort.
Side effects (notifications, external API calls) must be suppressed or made idempotent during replay; use explicit CONSUMER_MODE flags.
Reset consumer group offsets only after stopping the consumer group, and always use a separate consumer group for rebuilds to avoid affecting live offsets.
Projection rebuilds should write to a shadow table and promote atomically after catching up, not overwrite the live table in place.
Throttle replay consumers to avoid starving live consumers or overwhelming downstream databases.
Retention policy must account for replay requirements—events you cannot replay from the broker are gone unless you archive them separately.

Series

Event-Driven Architecture

Part 6 of 10

← Part 5

Outbox and Inbox

Part 7 →

Idempotency, End to End