Engineering

The Error-Budget Conversation, with Numbers

Ravinder·October 29, 2025·11 min read

EngineeringError BudgetsSLOReliabilitySRE

The Error-Budget Conversation, with Numbers

The product manager walks into the sprint planning meeting with a list of features. The engineering lead walks in knowing the service's error budget is 70% consumed with 12 days left in the month. Those two people need to have a conversation, and in most engineering organizations, they don't have the vocabulary or the data to have it well.

"We need to focus on reliability" versus "we need to ship features" is an unresolvable argument when it's framed that way. It's a values debate, and values debates produce heat but not decisions. Error budget framing converts it into an engineering tradeoff with actual numbers. That's the argument for the approach — not that it's elegant, but that it makes the right conversations possible.

SLI, SLO, SLA: The Hierarchy You Need to Internalize

These three terms get used interchangeably in conversations where they shouldn't be.

Service Level Indicator (SLI): A specific quantitative measurement of service behavior. Examples:

Request success rate: (successful requests) / (total requests)
p99 request latency: the latency value below which 99% of requests complete
Data freshness: time since last successful data pipeline run

An SLI is a raw number — a measurement. It has no target embedded in it.

Service Level Objective (SLO): A target for an SLI, measured over a window. Examples:

"99.5% of requests succeed over a rolling 30-day window"
"p99 latency is under 500ms for 95% of 5-minute windows"

An SLO is an internal engineering commitment. It's not shown to customers. It's the threshold that determines whether you have budget to spend or not.

Service Level Agreement (SLA): A contractual commitment to a customer, with consequences for violation. Examples:

"99% availability per calendar month, with service credits for violations"

SLAs are typically more lenient than SLOs. You run your SLO at 99.5% so that you have buffer before you breach your 99% SLA. The gap between the two is intentional.

graph TB A[SLI: Raw Measurement] --> B[SLO: Internal Target] B --> C[SLA: Customer Commitment] B --> D[Error Budget] D --> E{Budget consumed?} E -->|< 50%| F[Normal velocity: ship features] E -->|50–80%| G[Reliability review before new features] E -->|> 80%| H[Freeze new features: reliability work only] E -->|100%| I[SLO breach: full reliability sprint]

The Error Budget Math

If your SLO is 99.5% success rate over 30 days, your error budget is 0.5% of requests. Over 30 days, if your service handles 1 million requests, that's:

Total error budget = 1,000,000 × 0.005 = 5,000 failed requests

That's the pool of failures you're allowed before you breach your SLO. Some of those failures are expected (transient network issues, rare client errors). Some are incidents. The budget is shared across all causes.

Converting this to time for an availability SLO:

99.5% availability over 30 days
= 30 days × 24 hours × 60 minutes × (1 - 0.995)
= 43,200 minutes × 0.005
= 216 minutes of allowable downtime per month

Or roughly 3.6 hours. If a single incident took 47 minutes (like the payment outage I described earlier), that's 47/216 = 21.8% of your monthly error budget consumed in one event.

from dataclasses import dataclass
from typing import Optional
 
@dataclass
class ErrorBudget:
    slo_target: float          # e.g. 0.995 for 99.5%
    window_days: int           # e.g. 30
    total_requests: int        # expected volume in the window
 
    @property
    def budget_fraction(self) -> float:
        return 1.0 - self.slo_target
 
    @property
    def allowed_failures(self) -> int:
        return int(self.total_requests * self.budget_fraction)
 
    @property
    def allowed_downtime_minutes(self) -> float:
        return self.window_days * 24 * 60 * self.budget_fraction
 
    def consumption_pct(self, actual_failures: int) -> float:
        return (actual_failures / self.allowed_failures) * 100
 
    def burn_rate(self, actual_failures: int, elapsed_days: float) -> float:
        """How fast we're burning budget relative to the expected rate."""
        expected_fraction_elapsed = elapsed_days / self.window_days
        actual_fraction_consumed = actual_failures / self.allowed_failures
        if expected_fraction_elapsed == 0:
            return 0.0
        return actual_fraction_consumed / expected_fraction_elapsed
 
 
# Example: 99.5% SLO, 30-day window, 1M requests/month
budget = ErrorBudget(slo_target=0.995, window_days=30, total_requests=1_000_000)
print(f"Allowed failures: {budget.allowed_failures}")       # 5000
print(f"Allowed downtime: {budget.allowed_downtime_minutes:.0f} min")  # 216 min
 
# After 10 days (1/3 of window), if we've had 3000 failures:
# Expected by this point: 5000 * (10/30) = 1666 failures
# Actual: 3000 failures → burn rate = (3000/5000) / (10/30) = 1.8x
burn = budget.burn_rate(actual_failures=3000, elapsed_days=10)
print(f"Burn rate: {burn:.1f}x")  # 1.8x — burning faster than expected

Burn Rate: The Early Warning Signal

Burn rate is the ratio of how fast you're consuming error budget relative to how fast you're expected to consume it. A burn rate of 1.0 means you're on track to exactly hit (but not breach) your SLO at the end of the window. A burn rate of 2.0 means you'll exhaust your budget halfway through the window.

Burn rate is more actionable than raw consumption percentage because it gives you time to respond. If you're 15 days into a 30-day window and you've consumed 80% of your budget, you know you have a problem — but you might not be able to determine from that number alone whether you need to act today or whether it was a one-time incident that's now resolved.

Burn rate over the last 1 hour versus the last 6 hours versus the last 72 hours tells you whether the problem is acute (high 1-hour burn, low 72-hour burn — recent incident) or chronic (roughly equal across windows — persistent degradation).

xychart-beta title "Error Budget Consumption Over 30 Days" x-axis [Day 1, Day 5, Day 10, Day 15, Day 20, Day 25, Day 30] y-axis "Budget Consumed (%)" 0 --> 120 line [0, 8, 16, 55, 62, 78, 95] line [0, 16, 33, 50, 67, 84, 100]

The second line is the "burn at rate 1.0" reference. The first line shows a team that was healthy through day 10, had a significant incident around day 11-15 (the steep climb), stabilized, but is now projected to breach the SLO by day 30. That's a team that needs to investigate what happened on day 11-15 and whether it's fixed.

The Policy: What Changes When Budget is Burned

The error budget is useful only if it's connected to a policy that changes engineering behavior. Without a policy, it's just a dashboard metric that people nod at.

Our policy has three zones:

Green (0–50% consumed): Normal development velocity. Teams ship features. No reliability tax on the sprint.

Yellow (50–80% consumed): Reliability review required before new features are estimated into sprints. Each feature proposal includes a reliability impact assessment: "does this change touch the critical path? Does it add new failure modes?" Features that increase error surface require reliability work paired in the same sprint.

Red (80–100% consumed): Feature freeze for the owning team. All sprint capacity goes to reliability work. No exceptions without engineering director approval. The team runs a focused reliability sprint: identify the sources of budget consumption, implement fixes, validate that the fixes reduce burn rate.

Breached (100%+ consumed): Full incident review. Engineering manager notified. SLA impact assessment. Customer communications if SLA breach is likely. Post-incident review within 48 hours.

The policy needs to be written down, shared with product management, and applied consistently. The first time you enforce a feature freeze, there will be pushback. The pushback is manageable if the policy was agreed to in advance. It's a crisis if you're announcing the policy in the middle of a budget crunch.

# Error budget policy — engineering/reliability-policy.yaml
# Reviewed and signed by engineering leadership and product leadership
 
error_budget_policy:
  version: "2025-Q4"
  approved_by:
    - "VP Engineering"
    - "VP Product"
 
  zones:
    green:
      threshold: "0-50% consumed"
      action: "Normal feature velocity"
      review_required: false
 
    yellow:
      threshold: "50-80% consumed"
      action: "Reliability review before new features"
      review_required: true
      reviewer: "engineering-lead"
 
    red:
      threshold: "80-100% consumed"
      action: "Feature freeze; reliability sprint"
      review_required: true
      reviewer: "engineering-director"
      exceptions: "Requires engineering-director written approval"
 
    breached:
      threshold: ">100% consumed"
      action: "Full incident review; customer communication if SLA affected"
      required_notifications:
        - "engineering-director"
        - "customer-success"
        - "product-vp"

Reporting That Drives Decisions

The error budget report should be a single artifact, updated weekly, consumed by both engineering and product leadership. Our report has five sections:

1. Current status (the headline): Which zone are we in? Current consumption percentage. Days remaining in the window.

2. Top consumers: What caused the most budget consumption this week? Ideally broken down by incident, not by time period.

3. Burn rate trend: 1-hour, 6-hour, 72-hour burn rates. Is the burn accelerating or decelerating?

4. Reliability work in progress: What is the team actively working on to reduce burn rate? Expected impact and timeline.

5. Forecast: At current burn rate, where will we be at end of window? What needs to be true for us to stay in green zone?

The forecast section is where the product conversation happens. If the forecast shows us hitting red zone in 10 days, product management needs to know now, not in 10 days. The report creates the context for that conversation.

// Weekly error budget report interface
interface ErrorBudgetReport {
  service: string;
  slo: number;                 // e.g. 0.995
  windowDays: number;
  reportDate: string;
 
  current: {
    consumptionPct: number;
    zone: "green" | "yellow" | "red" | "breached";
    remainingBudgetMinutes: number;
    daysRemaining: number;
  };
 
  burnRates: {
    last1h: number;            // Relative to expected rate
    last6h: number;
    last72h: number;
  };
 
  topConsumers: Array<{
    incidentId: string;
    budgetConsumedPct: number;
    description: string;
    resolved: boolean;
  }>;
 
  reliabilityWork: Array<{
    title: string;
    owner: string;
    expectedBudgetImpactPct: number;
    eta: string;
  }>;
 
  forecast: {
    projectedConsumptionAtWindowEnd: number;
    projectedZone: "green" | "yellow" | "red" | "breached";
    confidence: "high" | "medium" | "low";
    notes: string;
  };
}

The Conversation to Have with Product

Armed with this data, here's how the conversation actually goes.

"We're at 68% error budget consumption with 14 days left. Our burn rate over the last 72 hours is 1.4x expected, which means if we do nothing we'll hit red zone in about 6 days. We have two reliability fixes in progress: one addressing the database timeout issue that caused 15% of this month's budget, and one reducing the tail latency on the search service. If both land by end of week, we project ending the month around 75-80% — yellow zone, within our SLO."

"Given that, we recommend not picking up the new checkout flow feature this sprint. We can slot it next sprint if the reliability fixes land as expected. The alternative is to pick it up now and accept that if we hit another incident this month, we'll breach our SLO."

This is a different conversation than "we need to focus on reliability." It's a tradeoff with numbers attached: here's the current state, here's the risk, here's the recommendation, here's the alternative. Product managers who work with engineering teams that communicate this way become advocates for reliability investment, because they understand what they're getting from it.

Setting the Right SLO

One trap to avoid: SLOs that are set by negotiation rather than by measurement. "Let's target 99.9%" sounds responsible. If your service has historically operated at 99.6%, a 99.9% SLO means your error budget is 10x smaller than the actual failure rate — you'll be in budget crunch permanently.

SLOs should be set by looking at:

Current measured performance (what's the actual baseline?)
Customer pain threshold (what failure rate causes real customer harm?)
Cost of improvement (what does it take to move from 99.6% to 99.9%?)

The right SLO is typically slightly more aggressive than current measured performance — ambitious enough to drive improvement, not so aggressive that it's permanently breached. If your service is at 99.6% today, a 99.7% SLO creates a small, achievable target. A 99.95% SLO is aspirational theater.

Key Takeaways

SLI, SLO, and SLA are distinct: SLI is a measurement, SLO is an internal target for that measurement, SLA is a customer contract — run your SLO more aggressively than your SLA to create buffer.
Error budget math is straightforward: budget = (1 - SLO target) × window × volume — for a 99.5% SLO on 1M monthly requests, that's 5,000 allowed failures or roughly 216 minutes of allowed downtime.
Burn rate (actual consumption rate ÷ expected consumption rate) is more actionable than raw percentage consumed — comparing 1-hour, 6-hour, and 72-hour burn rates distinguishes acute incidents from chronic degradation.
Error budgets require a written, pre-agreed policy that ties consumption zones to engineering behavior (feature freeze, reliability sprint, etc.) — without the policy the metric is just a dashboard nobody acts on.
Weekly error budget reports that include a forecast ("at current burn rate, where will we be at month end?") make the product-engineering tradeoff conversation concrete and solvable rather than a values debate.
SLOs should be set based on current measured performance and customer pain thresholds, not aspirational targets — an SLO that's permanently breached drives anxiety, not improvement.