Observability in Depth

Alert Design

Ravinder·September 26, 2025·6 min read

ObservabilityTelemetryAlertingSREOn-Call

Series

Observability in Depth

Part 9 of 10

← Part 8

SLOs That Drive Decisions

Part 10 →

Cost and the ROI Conversation

Alert fatigue is not a tooling problem. It is a design problem. Every false positive that wakes someone at 3 AM trains the team to ignore alerts. Within six months, the on-call rotation is either dismissing pages reflexively or turning off notifications — at which point your entire observability investment is worthless when it matters most.

Good alert design starts with a single question for every alert: is a human the right responder, and is right now the right time? If the answer is no, the alert should either be a ticket, a dashboard annotation, or simply removed.

The Alert Taxonomy

flowchart TD Signal["Anomaly Detected"] --> Q1{User impacted\nor will be?} Q1 -- Yes --> Q2{Requires human\naction immediately?} Q1 -- No --> NoPager["Log / metric / dashboard\nNo alert"] Q2 -- Yes --> Page["Page\n(PagerDuty / OpsGenie)"] Q2 -- No --> Q3{Requires action\nwithin hours?} Q3 -- Yes --> Ticket["Create ticket\n(Jira / Linear)"] Q3 -- No --> Review["Weekly SLO review\nitem"] style Page fill:#e74c3c,color:#fff style Ticket fill:#f39c12,color:#000 style NoPager fill:#2ecc71,color:#000

Most alerts belong in the bottom two categories. The discipline of alert design is being honest about which category each alert actually belongs in.

The Most Common Alert Mistakes

Mistake 1: Alerting on symptoms plus causes

If you alert on "high CPU" AND "high error rate" AND "high latency," and all three fire together in an incident, you get three pages for one problem. Alert on user-facing symptoms (error rate, latency). Causative signals (CPU, memory, saturation) are for debugging dashboards, not pager policies.

Mistake 2: Alert without a runbook

An alert that fires without a documented response is a mystery box at 3 AM. Every production alert must link to a runbook that answers: what does this mean, how do I confirm it, what are my remediation steps?

# Every Prometheus alert should have a runbook_url
annotations:
  summary: "Checkout API error rate elevated: {{ $value | humanizePercentage }}"
  runbook_url: "https://runbooks.internal/checkout-api/high-error-rate"

Mistake 3: for: duration too short

# BAD: fires on a 30-second spike
- alert: HighErrorRate
  expr: rate(http_requests_total{status=~"5.."}[1m]) > 0.05
  for: 1m   # fires after 1 minute of being above threshold
 
# BETTER: confirms signal is sustained
- alert: HighErrorRate
  expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
  for: 5m

Threshold Alerting: When It Is Appropriate

Pure threshold alerts are appropriate for categorical, non-statistical conditions:

TLS certificate expires in < 14 days (deterministic, needs human action)
Disk > 90% full (linear growth, time-to-action is clear)
Deployment replica count drops below minimum (binary condition)

groups:
  - name: infrastructure_thresholds
    rules:
      - alert: DiskFillingSoon
        expr: |
          predict_linear(node_filesystem_avail_bytes[6h], 4 * 3600) < 0
        for: 30m
        labels:
          severity: warning
        annotations:
          summary: "Disk on {{ $labels.instance }} filling: projected full in < 4h"
 
      - alert: DeploymentUnderReplicated
        expr: |
          kube_deployment_status_replicas_available
          < kube_deployment_spec_replicas * 0.75
        for: 5m
        labels:
          severity: page
        annotations:
          summary: "{{ $labels.deployment }} running at <75% of desired replicas"

Burn-Rate Alerting: The Standard for SLO-Based Alerts

For anything user-facing, burn-rate alerts are the correct model (introduced in Post 8). They detect the right condition at the right time:

Burn rate	Time to exhaust 30-day budget	Alert tier
14.4×	~50 hours	Page now
6×	~5 days	Ticket
3×	~10 days	Weekly review
≤1×	On track	No alert

The full four-alert, two-severity model covering different consumption rates:

groups:
  - name: slo_burn_checkout
    rules:
      # Severity: critical — 2% budget in 1h window
      - alert: SLOBurnCritical
        expr: |
          (
            job:slo_burn_rate:1h{job="checkout-api"} > 14.4
            and
            job:slo_burn_rate:5m{job="checkout-api"} > 14.4
          )
        for: 2m
        labels:
          severity: critical
          team: checkout
        annotations:
          summary: "Checkout SLO critical: {{ $value | humanize }}x burn rate"
          runbook_url: "https://runbooks.internal/slo-critical"
 
      # Severity: warning — 5% budget in 6h window
      - alert: SLOBurnWarning
        expr: |
          (
            job:slo_burn_rate:6h{job="checkout-api"} > 6
            and
            job:slo_burn_rate:30m{job="checkout-api"} > 6
          )
        for: 15m
        labels:
          severity: warning
          team: checkout
        annotations:
          summary: "Checkout SLO warning: {{ $value | humanize }}x burn rate"
 
      # Severity: info — ticket-worthy slow burn
      - alert: SLOBurnSlow
        expr: |
          job:slo_burn_rate:24h{job="checkout-api"} > 3
        for: 1h
        labels:
          severity: info
          team: checkout
        annotations:
          summary: "Checkout SLO: sustained 3x+ burn over 24h — create reliability ticket"

Alertmanager Routing: Getting the Right Alert to the Right Person

Not every alert should hit the same destination. Route by severity and ownership:

# alertmanager.yaml
route:
  group_by: ["alertname", "job"]
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: "default-ticket"
 
  routes:
    # Critical: page on-call immediately
    - match:
        severity: critical
      receiver: "pagerduty-checkout"
      continue: false
 
    # Warning: Slack alert, no page
    - match:
        severity: warning
      receiver: "slack-checkout-team"
      continue: false
 
    # Info: create Jira ticket via webhook
    - match:
        severity: info
      receiver: "jira-webhook"
 
receivers:
  - name: "pagerduty-checkout"
    pagerduty_configs:
      - service_key: "${PAGERDUTY_KEY}"
        description: "{{ .GroupLabels.alertname }}: {{ .Annotations.summary }}"
 
  - name: "slack-checkout-team"
    slack_configs:
      - api_url: "${SLACK_WEBHOOK}"
        channel: "#checkout-alerts"
        text: "{{ .Annotations.summary }}\n{{ .Annotations.runbook_url }}"
 
  - name: "jira-webhook"
    webhook_configs:
      - url: "http://jira-integration:8080/webhook"

Alert Quality Metrics

Track the health of your alert system the same way you track the health of your services:

# Alert noise ratio: % of alerts that auto-resolved within 15m (likely noise)
sum(increase(alertmanager_alerts_resolved_total[7d]))
- sum(increase(alertmanager_alerts_resolved_total{duration_minutes > 15}[7d]))
/
sum(increase(alertmanager_alerts_resolved_total[7d]))
 
# MTTA: mean time to acknowledge (Alertmanager custom metric via webhook)
avg(alert_ack_duration_minutes)

Review these weekly. Noise ratio above 20% is a signal that alert conditions need adjustment.

Key Takeaways

Alert fatigue is a design problem, not a tooling problem — every alert must pass the test of "does this require a human now?"
Alert on user-facing symptoms (error rate, latency); use causative signals (CPU, disk, memory) for dashboards, not pagers.
Every production alert must link to a runbook — an alert without documented response is a mystery at 3 AM.
Threshold alerts are correct for categorical, deterministic conditions; burn-rate alerts are correct for SLO-based, statistical conditions.
Multi-window burn-rate alerts (short + long window) eliminate false positives from brief spikes while detecting sustained degradation.
Track alert noise ratio weekly — a ratio above 20% indicates alert conditions that are incorrectly specified and are degrading on-call trust.

Series

Observability in Depth

Part 9 of 10

← Part 8

SLOs That Drive Decisions

Part 10 →

Cost and the ROI Conversation