Skip to main content
Cloud Cost Engineering

Cost as an SLO

Ravinder··6 min read
Cloud CostFinOpsAWSSLOBudget AlertsEngineering Culture
Share:
Cost as an SLO

SLOs work because they make reliability concrete: you have an error budget, a burn rate, and an alert threshold. When the burn rate is too high, engineering acts. The same model applies to cost — a monthly budget is an error budget for spend. When burn rate is too high, you are on track to exhaust it before the month ends. The only question is whether you find out in hour 48 or hour 720.

The Budget Burn Rate Model

Error budget thinking applied to cost:

Monthly budget:   $50,000
Daily allowance:  $50,000 / 30 = $1,667/day
Burn rate 1.0:    Exactly on pace — budget exhausted at end of month
Burn rate 1.5:    50% over pace — budget exhausted at day 20
Burn rate 2.0:    Double pace — budget exhausted at day 15

Alert at burn rate 1.5 (warning) and 2.0 (critical). These thresholds parallel the standard SLO fast-burn alert windows.

from datetime import datetime, timezone
import boto3
 
def calculate_burn_rate(monthly_budget: float) -> dict:
    """
    Calculate current cost burn rate against monthly budget.
    Uses AWS Cost Explorer for month-to-date actuals.
    """
    ce = boto3.client("ce", region_name="us-east-1")
    now    = datetime.now(timezone.utc)
    start  = now.replace(day=1, hour=0, minute=0, second=0, microsecond=0)
    days_elapsed = (now - start).days + 1
    days_in_month = 30  # approximate
 
    resp = ce.get_cost_and_usage(
        TimePeriod={
            "Start": start.strftime("%Y-%m-%d"),
            "End":   now.strftime("%Y-%m-%d"),
        },
        Granularity="MONTHLY",
        Metrics=["UnblendedCost"],
        Filter={"Not": {"Dimensions": {
            "Key": "RECORD_TYPE",
            "Values": ["Credit", "Refund", "Tax"],
        }}},
    )
 
    mtd_cost = float(resp["ResultsByTime"][0]["Total"]["UnblendedCost"]["Amount"])
    expected_mtd = monthly_budget * (days_elapsed / days_in_month)
    burn_rate    = mtd_cost / expected_mtd if expected_mtd > 0 else 0
    projected    = mtd_cost * (days_in_month / days_elapsed)
 
    return {
        "days_elapsed":    days_elapsed,
        "mtd_cost":        round(mtd_cost, 2),
        "expected_mtd":    round(expected_mtd, 2),
        "burn_rate":       round(burn_rate, 3),
        "projected_total": round(projected, 2),
        "monthly_budget":  monthly_budget,
        "budget_surplus":  round(monthly_budget - projected, 2),
    }
 
def burn_rate_status(burn_rate: float) -> str:
    if burn_rate >= 2.0: return "CRITICAL"
    if burn_rate >= 1.5: return "WARNING"
    if burn_rate >= 1.1: return "ELEVATED"
    return "OK"
 
if __name__ == "__main__":
    result = calculate_burn_rate(50000)
    status = burn_rate_status(result["burn_rate"])
    print(f"Status:    {status}")
    print(f"Burn rate: {result['burn_rate']}x")
    print(f"MTD cost:  ${result['mtd_cost']:,.2f} (expected ${result['expected_mtd']:,.2f})")
    print(f"Projected: ${result['projected_total']:,.2f} vs budget ${result['monthly_budget']:,.2f}")

AWS Budgets: The Native Alert Layer

# Terraform — multi-threshold cost budget with SNS alerts
resource "aws_budgets_budget" "monthly" {
  name         = "monthly-cost-budget"
  budget_type  = "COST"
  limit_amount = "50000"
  limit_unit   = "USD"
  time_unit    = "MONTHLY"
 
  # Warning threshold — 75% consumed or burn rate implies overage
  notification {
    comparison_operator        = "GREATER_THAN"
    threshold                  = 75
    threshold_type             = "PERCENTAGE"
    notification_type          = "ACTUAL"
    subscriber_sns_topic_arns  = [aws_sns_topic.cost_alerts.arn]
  }
 
  # Alert on forecast exceeding budget
  notification {
    comparison_operator        = "GREATER_THAN"
    threshold                  = 100
    threshold_type             = "PERCENTAGE"
    notification_type          = "FORECASTED"
    subscriber_sns_topic_arns  = [aws_sns_topic.cost_alerts.arn]
  }
 
  # Critical — actual spend exceeds budget
  notification {
    comparison_operator        = "GREATER_THAN"
    threshold                  = 100
    threshold_type             = "PERCENTAGE"
    notification_type          = "ACTUAL"
    subscriber_sns_topic_arns  = [aws_sns_topic.cost_alerts_critical.arn]
  }
}
 
resource "aws_budgets_budget" "per_team" {
  for_each = var.team_budgets   # map of team -> monthly budget amount
 
  name         = "team-${each.key}-monthly"
  budget_type  = "COST"
  limit_amount = tostring(each.value)
  limit_unit   = "USD"
  time_unit    = "MONTHLY"
 
  cost_filter {
    name   = "TagKeyValue"
    values = ["team$${each.key}"]
  }
 
  notification {
    comparison_operator       = "GREATER_THAN"
    threshold                 = 80
    threshold_type            = "PERCENTAGE"
    notification_type         = "FORECASTED"
    subscriber_email_addresses = ["finops@example.com"]
  }
}

Cost SLO Definition Template

Treat cost targets as SLOs in your service documentation:

# service-cost-slo.yaml
service: payment-api
owner: payments-team
budget_period: monthly
 
slos:
  - name: compute-cost-per-request
    description: "Compute spend per successful API request"
    metric: "aws_cost_usd{team='payments', service='EC2'} / sum(api_requests_total{status!~'5..'})"
    target: 0.00080  # $0.0008 per request
    window: 30d
    alert_threshold: 0.00100  # 25% over target triggers review
 
  - name: monthly-budget-burn-rate
    description: "Projected end-of-month spend vs monthly budget"
    metric: "projected_monthly_cost / monthly_budget"
    target: 1.00     # <= 100% of budget
    window: 7d
    alert_threshold: 1.20   # warn at 20% over projection
 
  - name: unallocated-spend-ratio
    description: "Fraction of spend with no team tag"
    metric: "cost{team='untagged'} / total_cost"
    target: 0.05     # < 5% untagged
    window: 7d
    alert_threshold: 0.10

The Cost Incident Workflow

sequenceDiagram participant Budget as AWS Budget Alert participant SNS as SNS Topic participant Lambda as Triage Lambda participant Slack as Slack #finops-alerts participant OC as On-Call Engineer participant CUR as CUR / Athena Budget->>SNS: Threshold breached SNS->>Lambda: Trigger triage Lambda->>CUR: Top services + resources (last 24h) Lambda->>Slack: Alert with triage summary Slack->>OC: Notification with context OC->>CUR: Investigate resource IDs OC->>Slack: Post findings + action taken Note over OC,Slack: Close loop in < 4 hours for CRITICAL

Lambda triage function that attaches context to budget alerts:

import boto3, json
from datetime import datetime, timedelta
 
def lambda_handler(event, context):
    """Triggered by SNS budget alert. Fetches top cost drivers and posts to Slack."""
    ce = boto3.client("ce", region_name="us-east-1")
    ssm = boto3.client("ssm")
 
    end   = datetime.utcnow().date()
    start = end - timedelta(days=1)
 
    top_services = ce.get_cost_and_usage(
        TimePeriod={"Start": str(start), "End": str(end)},
        Granularity="DAILY",
        Metrics=["UnblendedCost"],
        GroupBy=[{"Type": "DIMENSION", "Key": "SERVICE"}],
    )
 
    lines = ["*Budget Alert — Top Services (Last 24h)*", "```"]
    for group in sorted(
        top_services["ResultsByTime"][0]["Groups"],
        key=lambda x: float(x["Metrics"]["UnblendedCost"]["Amount"]),
        reverse=True
    )[:8]:
        svc  = group["Keys"][0]
        cost = float(group["Metrics"]["UnblendedCost"]["Amount"])
        lines.append(f"{svc:40s}  ${cost:>9,.2f}")
    lines.append("```")
    lines.append("Run `make cost-triage` for resource-level breakdown.")
 
    message = "\n".join(lines)
    webhook = ssm.get_parameter(Name="/finops/slack-webhook", WithDecryption=True)["Parameter"]["Value"]
 
    import urllib.request
    payload = json.dumps({"text": message}).encode()
    req = urllib.request.Request(webhook, data=payload,
                                  headers={"Content-Type": "application/json"})
    urllib.request.urlopen(req)
    return {"status": "ok"}

Cost SLO Dashboard Components

flowchart TD subgraph Dashboard["Cost SLO Dashboard"] BR[Burn Rate Gauge - 1.0x target] MTD[MTD vs Expected Spend] PROJ[30-day Projection vs Budget] TPR[Team Budget Status - RAG] TOP[Top Cost Movers - 7d] EFF[Cost per Request Trend] end subgraph Alerts["Alert Channels"] WARN[Slack warning at 1.5x burn] CRIT[PagerDuty critical at 2.0x burn] FORE[Budget forecast alert at 100%] end Dashboard --> Alerts

Key Takeaways

  • Burn rate is the correct framing for budget alerts — a threshold on total spend gives you a single alert when it is too late; burn rate gives you time to act.
  • Per-unit cost metrics (cost per request, cost per active user) are the only way to separate efficiency from growth; raw dollar increases are expected as you scale, deteriorating unit economics are not.
  • AWS Budgets with forecasted-spend alerts fire before you overspend, not after — always add a FORECASTED threshold alongside ACTUAL thresholds.
  • A cost incident workflow with a 4-hour response SLA for critical burns treats overspend with the same seriousness as a P1 availability incident — this cultural signal is more important than the specific alert threshold chosen.
  • The cost SLO YAML in your service repository makes cost targets a first-class engineering artifact, visible in code review and tracked in the same system as reliability SLOs.
  • This series closes where it started — the bill is a feature. When cost has burn rates, alerts, incidents, and SLOs, it behaves like every other reliability concern your engineering organization already knows how to manage.
Share: