SLOs That Drive Decisions
Most teams have SLOs on paper. Few have SLOs that change behavior. The difference is not the math — it is whether the error budget is connected to real decisions: which features ship this sprint, whether on-call gets a pager notification at 2 AM or an email Monday morning, and whether the business conversation is "we missed our SLO" or "here is what we stopped doing to protect reliability."
This post covers SLO mechanics that drive those decisions, starting from the SLI definition and ending with error budget policies that engineering leadership can act on.
The SLO Stack
Every layer depends on the one above it being correct. A misspecified SLI invalidates everything downstream.
Defining SLIs That Reflect User Experience
The cardinal rule: an SLI must measure what the user experiences, not what the infrastructure reports.
Bad SLI: "Server responds with non-5xx status." (A 200 with an empty body is counted as success.)
Good SLI: "Request returns a non-5xx status AND body contains required fields AND response time < 2s."
In Prometheus, a request-based availability SLI:
# SLI: fraction of requests that were "good"
# Good = status 2xx or 3xx AND latency < 2s
sum(rate(http_requests_total{
status!~"5..",
job="checkout-api",
le="2.0" # requires histogram with latency label or separate metric
}[5m]))
/
sum(rate(http_requests_total{job="checkout-api"}[5m]))For latency SLIs, use histograms properly:
# Latency SLI: fraction of requests completing within 500ms
sum(rate(http_request_duration_seconds_bucket{
job="checkout-api",
le="0.5"
}[5m]))
/
sum(rate(http_request_duration_seconds_count{job="checkout-api"}[5m]))Error Budget Calculation
With a 30-day SLO at 99.9%:
- Total minutes in period: 30 × 24 × 60 = 43,200
- Error budget: (1 - 0.999) × 43,200 = 43.2 minutes
- This is the allowed failure budget before the SLO is breached
Track remaining budget as a Prometheus recording rule:
# prometheus-slo-rules.yaml
groups:
- name: slo_checkout
interval: 60s
rules:
# 30-day error rate
- record: job:slo_error_rate:30d
expr: |
1 - (
sum(increase(http_requests_total{job="checkout-api", status!~"5.."}[30d]))
/
sum(increase(http_requests_total{job="checkout-api"}[30d]))
)
# Remaining error budget (fraction of 0.001 budget consumed)
- record: job:error_budget_remaining:30d
expr: |
1 - (job:slo_error_rate:30d / 0.001)# Dashboard: error budget burn-down
100 * job:error_budget_remaining:30d{job="checkout-api"}Burn-Rate Alerts: Multi-Window, Multi-Burn
A threshold alert on error rate is too noisy (fires for brief spikes) and too slow (misses sustained moderate degradation). Google's SRE workbook approach uses burn-rate — how fast you are consuming the error budget relative to how fast the SLO period elapses.
A burn rate of 1 means you are exactly on track to exhaust the budget at the end of the period. A burn rate of 14.4 means you will exhaust the 30-day budget in 50 hours (30 days / 14.4 = ~2 days), which justifies a page.
# 1-hour burn rate: how much budget consumed vs. how much time elapsed
(
1 - sum(rate(http_requests_total{job="checkout-api", status!~"5.."}[1h]))
/ sum(rate(http_requests_total{job="checkout-api"}[1h]))
) / 0.001Multi-window burn-rate alert (the standard approach):
groups:
- name: slo_burn_checkout
rules:
# Page: fast burn — 2% budget in 1h from short AND long window
- alert: CheckoutSLOCritical
expr: |
(
job:slo_burn_rate:1h{job="checkout-api"} > 14.4
and
job:slo_burn_rate:5m{job="checkout-api"} > 14.4
)
for: 2m
labels:
severity: page
annotations:
summary: "Checkout SLO: critical burn rate {{ $value }}x"
description: "At current rate, 30-day error budget exhausted in < 50 hours"
# Ticket: slow burn — 5% budget in 6h from short AND long window
- alert: CheckoutSLOWarning
expr: |
(
job:slo_burn_rate:6h{job="checkout-api"} > 6
and
job:slo_burn_rate:30m{job="checkout-api"} > 6
)
for: 15m
labels:
severity: ticket
annotations:
summary: "Checkout SLO: elevated burn rate {{ $value }}x"The dual-window condition (short AND long) eliminates false positives from brief spikes — the long window confirms the signal is sustained.
Burn-Rate Recording Rules
Pre-compute the burn rates to avoid expensive ad-hoc queries:
groups:
- name: burn_rate_precompute
interval: 30s
rules:
- record: job:slo_burn_rate:5m
expr: |
(1 - sum by(job)(rate(http_requests_total{status!~"5.."}[5m]))
/ sum by(job)(rate(http_requests_total[5m])))
/ 0.001
- record: job:slo_burn_rate:30m
expr: |
(1 - sum by(job)(rate(http_requests_total{status!~"5.."}[30m]))
/ sum by(job)(rate(http_requests_total[30m])))
/ 0.001
- record: job:slo_burn_rate:1h
expr: |
(1 - sum by(job)(rate(http_requests_total{status!~"5.."}[1h]))
/ sum by(job)(rate(http_requests_total[1h])))
/ 0.001
- record: job:slo_burn_rate:6h
expr: |
(1 - sum by(job)(rate(http_requests_total{status!~"5.."}[6h]))
/ sum by(job)(rate(http_requests_total[6h])))
/ 0.001Error Budget Policy: Connecting SLOs to Decisions
An SLO without a policy is decoration. Define what happens at each budget level:
| Budget remaining | Action |
|---|---|
| > 50% | Normal feature velocity |
| 25–50% | Reliability work enters sprint; feature additions review |
| 10–25% | Feature freeze; focus on reliability improvements |
| < 10% | Incident mode; no feature work until budget recovers |
| Exhausted | Executive escalation; customer notification if external |
Document this policy in your runbook and make the budget remaining metric visible on your team dashboard. "We have 8.3 minutes of error budget left this month" is a concrete statement that non-technical stakeholders can act on.
Key Takeaways
- SLIs must measure user experience, not infrastructure health — a 200 status with a broken body is not a good request.
- Error budget is the quantified consequence of the SLO target — it makes the abstract ("99.9% uptime") concrete ("43.2 minutes of allowed downtime per month").
- Burn-rate alerts outperform threshold alerts because they detect sustained moderate degradation that brief spikes miss, with far fewer false positives.
- Multi-window burn-rate (short window confirms signal, long window confirms it is sustained) is the standard approach from the Google SRE workbook.
- Pre-compute burn rates as recording rules — ad-hoc multi-window PromQL over long ranges is query-expensive.
- An error budget policy that specifies actions at each budget threshold is what turns an SLO from a metric into a decision-making tool.