Architecture

Designing Webhooks People Don't Hate

Ravinder·July 15, 2025·10 min read

ArchitectureWebhooksAPI DesignReliability

I have debugged other people's webhook implementations more times than I care to count. The failure mode is almost always the same: a webhook system that was designed as a simple HTTP POST — because that is all webhooks are, conceptually — without thinking through the operational consequences of that POST being unreliable, potentially duplicated, occasionally forged, and consumed by a server behind a firewall that the sender cannot reach half the time.

The consequence of getting webhooks wrong is a specific kind of user frustration: they set up the integration, it works in testing, it works for a week in production, and then it silently stops. Events are missed. Orders are not fulfilled. Alerts are not fired. The customer files a support ticket three days later when they notice the downstream system is out of sync. Nobody wins.

Good webhook design is not complicated. It has about eight properties. Most systems implement two or three of them.

The Baseline: What a Webhook Actually Is

A webhook is an HTTP POST that your system sends to a URL provided by the consumer, triggered by an event in your system. The payload describes the event. The consumer's server receives the POST and processes it.

That simplicity is the trap. The simplicity of the mechanism leads teams to underestimate the operational requirements. A REST API has a clear request/response cycle. If the request fails, the caller retries. The caller is you. You control the retry logic, the backoff, the circuit breaker.

With webhooks, the consumer is the "caller" in the response cycle and the sender for the retry cycle. Your system must handle delivery reliability, because the consumer's server is not under your control and is routinely unreliable.

sequenceDiagram participant YourSystem as Your System participant Queue as Event Queue participant Dispatcher as Webhook Dispatcher participant Consumer as Consumer Server YourSystem->>Queue: Publish event Queue->>Dispatcher: Dequeue event Dispatcher->>Consumer: POST /webhook (attempt 1) Consumer-->>Dispatcher: 500 Internal Server Error Dispatcher->>Dispatcher: Backoff (30s) Dispatcher->>Consumer: POST /webhook (attempt 2) Consumer-->>Dispatcher: 200 OK Dispatcher->>Queue: Acknowledge event

Property One: HMAC Signing

Every webhook request you send must be signed. Without signing, your consumer cannot distinguish a legitimate event from a malicious POST sent by anyone who knows their webhook URL. Webhook URLs are not secret — they appear in browser history, server logs, and support tickets.

The standard approach is HMAC-SHA256. You and the consumer share a secret (generated at webhook registration, not reusable across endpoints). For each request, you compute:

import hashlib
import hmac
import json
import time
 
def sign_payload(secret: str, payload: dict) -> dict:
    """Sign a webhook payload and return headers."""
    body = json.dumps(payload, separators=(',', ':'), sort_keys=True)
    timestamp = str(int(time.time()))
 
    # Include timestamp in signature to prevent replay attacks
    signed_content = f"{timestamp}.{body}"
    signature = hmac.new(
        secret.encode(),
        signed_content.encode(),
        hashlib.sha256,
    ).hexdigest()
 
    return {
        "X-Webhook-Timestamp": timestamp,
        "X-Webhook-Signature": f"sha256={signature}",
        "Content-Type": "application/json",
    }
 
def verify_signature(secret: str, body: str, timestamp: str, signature: str) -> bool:
    """Consumer-side verification."""
    # Reject events older than 5 minutes (replay protection)
    if abs(time.time() - int(timestamp)) > 300:
        return False
 
    signed_content = f"{timestamp}.{body}"
    expected = hmac.new(
        secret.encode(),
        signed_content.encode(),
        hashlib.sha256,
    ).hexdigest()
 
    return hmac.compare_digest(f"sha256={expected}", signature)

Three things to get right:

Include the timestamp in the signed content. This prevents replay attacks where an attacker captures a legitimate signed request and re-sends it later. Reject signatures with timestamps more than 5 minutes old.
Use hmac.compare_digest for comparison. This prevents timing attacks that could leak the expected signature bit by bit.
Use a stable JSON serialization. Sorted keys, no whitespace changes between signing and verification. Send the raw body to the consumer and document that they must verify against the raw bytes, not a parsed and re-serialized version.

Property Two: Idempotency

Your dispatcher will retry failed deliveries. This means the consumer will sometimes receive the same event more than once — either because the consumer's server returned a 5xx but actually processed the request before the error, or because the dispatcher crashed mid-delivery.

Every webhook event must have a stable, unique event_id. The consumer uses this to deduplicate.

{
  "event_id": "evt_01J2HK8M3B4N5P6Q7R8S9T",
  "event_type": "order.completed",
  "created_at": "2025-07-15T10:23:45Z",
  "data": {
    "order_id": "ord_8827364",
    "total": 9999,
    "currency": "USD"
  }
}

Document explicitly in your webhook documentation that event_id is the idempotency key and that consumers should deduplicate. Provide a code example. Do not assume consumers will infer this.

Consumer-side deduplication is straightforward with Redis:

import redis
 
r = redis.Redis()
 
def handle_webhook(event: dict) -> None:
    event_id = event["event_id"]
    key = f"webhook:processed:{event_id}"
 
    # SET NX with 7-day TTL
    if not r.set(key, "1", nx=True, ex=604800):
        # Already processed — return 200 without reprocessing
        return
 
    process_event(event)

The 7-day TTL covers the realistic maximum delivery window for your retry schedule.

Property Three: Retry and Backoff

The retry schedule is where most webhook systems either abandon events too early or hammer consumer servers with rapid retries during legitimate outages.

Use exponential backoff with jitter. A sensible schedule:

Attempt 1:  Immediate
Attempt 2:  30 seconds
Attempt 3:  5 minutes
Attempt 4:  30 minutes
Attempt 5:  2 hours
Attempt 6:  8 hours
Attempt 7:  24 hours
→ Dead letter queue

Total window: ~34 hours. This covers most server outages, deployments, and certificate renewals. It does not cover multi-day infrastructure incidents, which should be handled via your DLQ replay mechanism.

Jitter prevents thundering herd. If 10,000 webhooks all failed at the same moment (during a consumer outage) and you retry them all at the exact same backoff intervals, you will hammer the consumer server with 10,000 simultaneous retries. Add ±20% random jitter to each backoff interval.

import random
 
def backoff_seconds(attempt: int) -> float:
    base_delays = [0, 30, 300, 1800, 7200, 28800, 86400]
    base = base_delays[min(attempt, len(base_delays) - 1)]
    jitter = base * 0.2 * (random.random() * 2 - 1)  # ±20%
    return max(1, base + jitter)

Only retry on 5xx status codes and connection failures. A 4xx means the consumer actively rejected the payload — retrying is futile and disrespectful of the consumer's intent. Document this behavior.

Property Four: The Dead-Letter Surface

Events that exhaust all retry attempts must not silently disappear. They go to a dead-letter queue (DLQ). But a DLQ that operators cannot see or replay is nearly useless.

Your webhook management UI must expose:

A list of dead-lettered events with the event type, created timestamp, and last failure reason.
The full payload of each dead-lettered event.
A replay button that re-attempts delivery immediately.
A bulk replay capability for replaying all DLQ events for a given endpoint.
Exportable event payloads as JSON, so consumers can process them manually if the endpoint cannot be restored.

flowchart TD DLQ[Dead Letter Queue] --> UI[Webhook Dashboard] UI --> Inspect[Inspect Payload] UI --> Replay[Single Replay] UI --> BulkReplay[Bulk Replay] UI --> Export[Export JSON] Replay --> Dispatcher[Dispatcher] BulkReplay --> Dispatcher Dispatcher --> Consumer[Consumer Endpoint] style DLQ fill:#dc2626,color:#fff style UI fill:#4f46e5,color:#fff

The DLQ replay UI is not a nice-to-have. It is the feature your enterprise customers will ask about in security reviews and will use during incidents. Customers whose server was down for a weekend need to be able to recover all missed events without filing a support ticket.

Property Five: Event Schema Stability

Webhook consumers build integrations that run in production for years. Breaking the payload schema breaks production integrations.

Treat your webhook event schema with the same stability guarantees as your versioned API. Rules:

Additive changes only. New fields can be added to existing event types. Existing fields must not change type, name, or semantics.
New event types are non-breaking. Consumers must silently ignore event types they do not handle. Document this explicitly.
Envelope schema is frozen. The top-level fields (event_id, event_type, created_at, data) must not change shape. Only data evolves.
Breaking changes require a new event version. If you must break a field, introduce order.completed.v2 and sunset order.completed with a 6-month notice period.

Property Six: Delivery Guarantees in Documentation

Consumers need to know what guarantees you provide, in plain language, in your documentation. A page titled "Webhook Reliability" that answers these questions:

At-least-once or exactly-once? Be honest. At-least-once with idempotency keys is almost always the right answer. Exactly-once is practically impossible across a network.
Maximum delivery window. How long will you retry before dead-lettering? (34 hours, in my schedule above.)
Maximum event age preserved. If an endpoint is disabled and re-enabled, can historical events be replayed? How far back?
Payload size limits. What is the maximum POST body size? What happens to events that exceed it?
Concurrent delivery. Will you send multiple events in parallel to the same endpoint? What is the maximum concurrency?

The concurrent delivery question matters for consumers who process events that modify shared state. If you send events in parallel without documenting it, consumers who assume sequential delivery will have race conditions.

Property Seven: Test Event Capability

Every webhook endpoint configuration UI must have a "Send Test Event" button. It sends a synthetic event with clearly marked test data to the configured URL.

{
  "event_id": "evt_TEST_01J2HK8M3B4N5P6Q7R8S",
  "event_type": "webhook.test",
  "created_at": "2025-07-15T10:23:45Z",
  "test": true,
  "data": {
    "message": "This is a test webhook delivery from ExampleCorp."
  }
}

The test event should go through your real delivery pipeline — real signing, real retry logic, real delivery logs. If you implement a fake code path for test events, the test will not catch configuration problems in the real path.

Property Eight: Delivery Logs

Expose a delivery log per endpoint: every attempt, the HTTP status code returned, the response body (truncated), and the timestamp. Consumers debugging integration failures need this. Without logs, every support ticket becomes a multi-day email thread.

Retention: 7 days of delivery logs is a minimum. 30 days is significantly better.

Log schema:

{
  "attempt_id": "atm_01J2HK9M3C4N5P6Q7R8S9T",
  "event_id": "evt_01J2HK8M3B4N5P6Q7R8S9T",
  "endpoint_url": "https://consumer.example.com/webhooks",
  "attempted_at": "2025-07-15T10:24:15Z",
  "duration_ms": 342,
  "http_status": 200,
  "response_body": "{\"received\":true}",
  "outcome": "success"
}

The Customer-Facing Documentation Checklist

Webhook documentation has a standard set of sections that teams either omit or under-specify:

Getting started: How to register an endpoint. What URL format is required. HTTPS requirements.
Security: How to verify the X-Webhook-Signature header. Code examples in every SDK language.
Event types: Full catalog of event types with example payloads for each.
Event schema: Documented stability guarantees, versioning policy, and the envelope format.
Reliability: At-least-once guarantee, retry schedule, DLQ behavior, and replay instructions.
Idempotency: Explanation of event_id, how to use it for deduplication, and example deduplication code.
Testing: How to use the test event feature and how to use your sandbox environment.
Troubleshooting: How to read the delivery log, common failure codes and their causes, how to re-enable a disabled endpoint.

The security section deserves special attention. Provide copy-paste verification code in Python, Node.js, Ruby, Go, and PHP at minimum. Consumers who get the verification wrong will disable verification entirely, which defeats the purpose.

Key Takeaways

HMAC-SHA256 signing with a per-endpoint secret and a timestamp in the signed content is the minimum security baseline — without it, webhook URLs are trivially forgeable.
Idempotency keys (event_id) must be documented and example deduplication code must be provided; at-least-once delivery is a guarantee, not a bug.
Exponential backoff with ±20% jitter and a 34-hour total retry window covers most real-world consumer outages without overwhelming recovering servers.
Dead-letter queues are only useful when paired with a management UI that lets consumers inspect, replay, and export failed events without filing support tickets.
Webhook event schemas deserve the same stability guarantees as versioned APIs: additive changes only, breaking changes via new event type versions with a deprecation notice.
The delivery log is the debugging tool your consumers will use every time something goes wrong; 30-day retention and request/response capture are table stakes.