Observability in Depth

SLOs That Drive Decisions

Ravinder·September 19, 2025·5 min read

ObservabilityTelemetrySLOSREError Budget

Series

Observability in Depth

Part 8 of 10

← Part 7

RUM and the Front-end Gap

Part 9 →

Alert Design

Most teams have SLOs on paper. Few have SLOs that change behavior. The difference is not the math — it is whether the error budget is connected to real decisions: which features ship this sprint, whether on-call gets a pager notification at 2 AM or an email Monday morning, and whether the business conversation is "we missed our SLO" or "here is what we stopped doing to protect reliability."

This post covers SLO mechanics that drive those decisions, starting from the SLI definition and ending with error budget policies that engineering leadership can act on.

The SLO Stack

flowchart TD SLI["SLI — What we measure\n(good requests / total requests)"] SLO["SLO — Our target\n(99.9% over 30 days)"] EB["Error Budget — Allowed failures\n(43.2 min/month of downtime)"] EBP["Error Budget Policy — What we do\n(freeze features at 50% burn)"] Alert["Burn-Rate Alert — When we act\n(multi-window detection)"] SLI --> SLO --> EB --> EBP EB --> Alert

Every layer depends on the one above it being correct. A misspecified SLI invalidates everything downstream.

Defining SLIs That Reflect User Experience

The cardinal rule: an SLI must measure what the user experiences, not what the infrastructure reports.

Bad SLI: "Server responds with non-5xx status." (A 200 with an empty body is counted as success.)

Good SLI: "Request returns a non-5xx status AND body contains required fields AND response time < 2s."

In Prometheus, a request-based availability SLI:

# SLI: fraction of requests that were "good"
# Good = status 2xx or 3xx AND latency < 2s
 
sum(rate(http_requests_total{
  status!~"5..",
  job="checkout-api",
  le="2.0"  # requires histogram with latency label or separate metric
}[5m]))
/
sum(rate(http_requests_total{job="checkout-api"}[5m]))

For latency SLIs, use histograms properly:

# Latency SLI: fraction of requests completing within 500ms
sum(rate(http_request_duration_seconds_bucket{
  job="checkout-api",
  le="0.5"
}[5m]))
/
sum(rate(http_request_duration_seconds_count{job="checkout-api"}[5m]))

Error Budget Calculation

With a 30-day SLO at 99.9%:

Total minutes in period: 30 × 24 × 60 = 43,200
Error budget: (1 - 0.999) × 43,200 = 43.2 minutes
This is the allowed failure budget before the SLO is breached

Track remaining budget as a Prometheus recording rule:

# prometheus-slo-rules.yaml
groups:
  - name: slo_checkout
    interval: 60s
    rules:
      # 30-day error rate
      - record: job:slo_error_rate:30d
        expr: |
          1 - (
            sum(increase(http_requests_total{job="checkout-api", status!~"5.."}[30d]))
            /
            sum(increase(http_requests_total{job="checkout-api"}[30d]))
          )
 
      # Remaining error budget (fraction of 0.001 budget consumed)
      - record: job:error_budget_remaining:30d
        expr: |
          1 - (job:slo_error_rate:30d / 0.001)

# Dashboard: error budget burn-down
100 * job:error_budget_remaining:30d{job="checkout-api"}

Burn-Rate Alerts: Multi-Window, Multi-Burn

A threshold alert on error rate is too noisy (fires for brief spikes) and too slow (misses sustained moderate degradation). Google's SRE workbook approach uses burn-rate — how fast you are consuming the error budget relative to how fast the SLO period elapses.

A burn rate of 1 means you are exactly on track to exhaust the budget at the end of the period. A burn rate of 14.4 means you will exhaust the 30-day budget in 50 hours (30 days / 14.4 = ~2 days), which justifies a page.

# 1-hour burn rate: how much budget consumed vs. how much time elapsed
(
  1 - sum(rate(http_requests_total{job="checkout-api", status!~"5.."}[1h]))
      / sum(rate(http_requests_total{job="checkout-api"}[1h]))
) / 0.001

Multi-window burn-rate alert (the standard approach):

groups:
  - name: slo_burn_checkout
    rules:
      # Page: fast burn — 2% budget in 1h from short AND long window
      - alert: CheckoutSLOCritical
        expr: |
          (
            job:slo_burn_rate:1h{job="checkout-api"} > 14.4
            and
            job:slo_burn_rate:5m{job="checkout-api"} > 14.4
          )
        for: 2m
        labels:
          severity: page
        annotations:
          summary: "Checkout SLO: critical burn rate {{ $value }}x"
          description: "At current rate, 30-day error budget exhausted in < 50 hours"
 
      # Ticket: slow burn — 5% budget in 6h from short AND long window
      - alert: CheckoutSLOWarning
        expr: |
          (
            job:slo_burn_rate:6h{job="checkout-api"} > 6
            and
            job:slo_burn_rate:30m{job="checkout-api"} > 6
          )
        for: 15m
        labels:
          severity: ticket
        annotations:
          summary: "Checkout SLO: elevated burn rate {{ $value }}x"

The dual-window condition (short AND long) eliminates false positives from brief spikes — the long window confirms the signal is sustained.

Burn-Rate Recording Rules

Pre-compute the burn rates to avoid expensive ad-hoc queries:

groups:
  - name: burn_rate_precompute
    interval: 30s
    rules:
      - record: job:slo_burn_rate:5m
        expr: |
          (1 - sum by(job)(rate(http_requests_total{status!~"5.."}[5m]))
               / sum by(job)(rate(http_requests_total[5m])))
          / 0.001
 
      - record: job:slo_burn_rate:30m
        expr: |
          (1 - sum by(job)(rate(http_requests_total{status!~"5.."}[30m]))
               / sum by(job)(rate(http_requests_total[30m])))
          / 0.001
 
      - record: job:slo_burn_rate:1h
        expr: |
          (1 - sum by(job)(rate(http_requests_total{status!~"5.."}[1h]))
               / sum by(job)(rate(http_requests_total[1h])))
          / 0.001
 
      - record: job:slo_burn_rate:6h
        expr: |
          (1 - sum by(job)(rate(http_requests_total{status!~"5.."}[6h]))
               / sum by(job)(rate(http_requests_total[6h])))
          / 0.001

Error Budget Policy: Connecting SLOs to Decisions

An SLO without a policy is decoration. Define what happens at each budget level:

Budget remaining	Action
> 50%	Normal feature velocity
25–50%	Reliability work enters sprint; feature additions review
10–25%	Feature freeze; focus on reliability improvements
< 10%	Incident mode; no feature work until budget recovers
Exhausted	Executive escalation; customer notification if external

Document this policy in your runbook and make the budget remaining metric visible on your team dashboard. "We have 8.3 minutes of error budget left this month" is a concrete statement that non-technical stakeholders can act on.

Key Takeaways

SLIs must measure user experience, not infrastructure health — a 200 status with a broken body is not a good request.
Error budget is the quantified consequence of the SLO target — it makes the abstract ("99.9% uptime") concrete ("43.2 minutes of allowed downtime per month").
Burn-rate alerts outperform threshold alerts because they detect sustained moderate degradation that brief spikes miss, with far fewer false positives.
Multi-window burn-rate (short window confirms signal, long window confirms it is sustained) is the standard approach from the Google SRE workbook.
Pre-compute burn rates as recording rules — ad-hoc multi-window PromQL over long ranges is query-expensive.
An error budget policy that specifies actions at each budget threshold is what turns an SLO from a metric into a decision-making tool.

Series

Observability in Depth

Part 8 of 10

← Part 7

RUM and the Front-end Gap

Part 9 →

Alert Design