Alert Design
Series
Observability in DepthAlert fatigue is not a tooling problem. It is a design problem. Every false positive that wakes someone at 3 AM trains the team to ignore alerts. Within six months, the on-call rotation is either dismissing pages reflexively or turning off notifications — at which point your entire observability investment is worthless when it matters most.
Good alert design starts with a single question for every alert: is a human the right responder, and is right now the right time? If the answer is no, the alert should either be a ticket, a dashboard annotation, or simply removed.
The Alert Taxonomy
Most alerts belong in the bottom two categories. The discipline of alert design is being honest about which category each alert actually belongs in.
The Most Common Alert Mistakes
Mistake 1: Alerting on symptoms plus causes
If you alert on "high CPU" AND "high error rate" AND "high latency," and all three fire together in an incident, you get three pages for one problem. Alert on user-facing symptoms (error rate, latency). Causative signals (CPU, memory, saturation) are for debugging dashboards, not pager policies.
Mistake 2: Alert without a runbook
An alert that fires without a documented response is a mystery box at 3 AM. Every production alert must link to a runbook that answers: what does this mean, how do I confirm it, what are my remediation steps?
# Every Prometheus alert should have a runbook_url
annotations:
summary: "Checkout API error rate elevated: {{ $value | humanizePercentage }}"
runbook_url: "https://runbooks.internal/checkout-api/high-error-rate"Mistake 3: for: duration too short
# BAD: fires on a 30-second spike
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[1m]) > 0.05
for: 1m # fires after 1 minute of being above threshold
# BETTER: confirms signal is sustained
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
for: 5mThreshold Alerting: When It Is Appropriate
Pure threshold alerts are appropriate for categorical, non-statistical conditions:
- TLS certificate expires in < 14 days (deterministic, needs human action)
- Disk > 90% full (linear growth, time-to-action is clear)
- Deployment replica count drops below minimum (binary condition)
groups:
- name: infrastructure_thresholds
rules:
- alert: DiskFillingSoon
expr: |
predict_linear(node_filesystem_avail_bytes[6h], 4 * 3600) < 0
for: 30m
labels:
severity: warning
annotations:
summary: "Disk on {{ $labels.instance }} filling: projected full in < 4h"
- alert: DeploymentUnderReplicated
expr: |
kube_deployment_status_replicas_available
< kube_deployment_spec_replicas * 0.75
for: 5m
labels:
severity: page
annotations:
summary: "{{ $labels.deployment }} running at <75% of desired replicas"Burn-Rate Alerting: The Standard for SLO-Based Alerts
For anything user-facing, burn-rate alerts are the correct model (introduced in Post 8). They detect the right condition at the right time:
| Burn rate | Time to exhaust 30-day budget | Alert tier |
|---|---|---|
| 14.4× | ~50 hours | Page now |
| 6× | ~5 days | Ticket |
| 3× | ~10 days | Weekly review |
| ≤1× | On track | No alert |
The full four-alert, two-severity model covering different consumption rates:
groups:
- name: slo_burn_checkout
rules:
# Severity: critical — 2% budget in 1h window
- alert: SLOBurnCritical
expr: |
(
job:slo_burn_rate:1h{job="checkout-api"} > 14.4
and
job:slo_burn_rate:5m{job="checkout-api"} > 14.4
)
for: 2m
labels:
severity: critical
team: checkout
annotations:
summary: "Checkout SLO critical: {{ $value | humanize }}x burn rate"
runbook_url: "https://runbooks.internal/slo-critical"
# Severity: warning — 5% budget in 6h window
- alert: SLOBurnWarning
expr: |
(
job:slo_burn_rate:6h{job="checkout-api"} > 6
and
job:slo_burn_rate:30m{job="checkout-api"} > 6
)
for: 15m
labels:
severity: warning
team: checkout
annotations:
summary: "Checkout SLO warning: {{ $value | humanize }}x burn rate"
# Severity: info — ticket-worthy slow burn
- alert: SLOBurnSlow
expr: |
job:slo_burn_rate:24h{job="checkout-api"} > 3
for: 1h
labels:
severity: info
team: checkout
annotations:
summary: "Checkout SLO: sustained 3x+ burn over 24h — create reliability ticket"Alertmanager Routing: Getting the Right Alert to the Right Person
Not every alert should hit the same destination. Route by severity and ownership:
# alertmanager.yaml
route:
group_by: ["alertname", "job"]
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: "default-ticket"
routes:
# Critical: page on-call immediately
- match:
severity: critical
receiver: "pagerduty-checkout"
continue: false
# Warning: Slack alert, no page
- match:
severity: warning
receiver: "slack-checkout-team"
continue: false
# Info: create Jira ticket via webhook
- match:
severity: info
receiver: "jira-webhook"
receivers:
- name: "pagerduty-checkout"
pagerduty_configs:
- service_key: "${PAGERDUTY_KEY}"
description: "{{ .GroupLabels.alertname }}: {{ .Annotations.summary }}"
- name: "slack-checkout-team"
slack_configs:
- api_url: "${SLACK_WEBHOOK}"
channel: "#checkout-alerts"
text: "{{ .Annotations.summary }}\n{{ .Annotations.runbook_url }}"
- name: "jira-webhook"
webhook_configs:
- url: "http://jira-integration:8080/webhook"Alert Quality Metrics
Track the health of your alert system the same way you track the health of your services:
# Alert noise ratio: % of alerts that auto-resolved within 15m (likely noise)
sum(increase(alertmanager_alerts_resolved_total[7d]))
- sum(increase(alertmanager_alerts_resolved_total{duration_minutes > 15}[7d]))
/
sum(increase(alertmanager_alerts_resolved_total[7d]))
# MTTA: mean time to acknowledge (Alertmanager custom metric via webhook)
avg(alert_ack_duration_minutes)Review these weekly. Noise ratio above 20% is a signal that alert conditions need adjustment.
Key Takeaways
- Alert fatigue is a design problem, not a tooling problem — every alert must pass the test of "does this require a human now?"
- Alert on user-facing symptoms (error rate, latency); use causative signals (CPU, disk, memory) for dashboards, not pagers.
- Every production alert must link to a runbook — an alert without documented response is a mystery at 3 AM.
- Threshold alerts are correct for categorical, deterministic conditions; burn-rate alerts are correct for SLO-based, statistical conditions.
- Multi-window burn-rate alerts (short + long window) eliminate false positives from brief spikes while detecting sustained degradation.
- Track alert noise ratio weekly — a ratio above 20% indicates alert conditions that are incorrectly specified and are degrading on-call trust.