Cost as an SLO
Ravinder··6 min read
Cloud CostFinOpsAWSSLOBudget AlertsEngineering Culture
Series
Cloud Cost Engineering← Part 9
Showback vs Chargeback
End of series
SLOs work because they make reliability concrete: you have an error budget, a burn rate, and an alert threshold. When the burn rate is too high, engineering acts. The same model applies to cost — a monthly budget is an error budget for spend. When burn rate is too high, you are on track to exhaust it before the month ends. The only question is whether you find out in hour 48 or hour 720.
The Budget Burn Rate Model
Error budget thinking applied to cost:
Monthly budget: $50,000
Daily allowance: $50,000 / 30 = $1,667/day
Burn rate 1.0: Exactly on pace — budget exhausted at end of month
Burn rate 1.5: 50% over pace — budget exhausted at day 20
Burn rate 2.0: Double pace — budget exhausted at day 15Alert at burn rate 1.5 (warning) and 2.0 (critical). These thresholds parallel the standard SLO fast-burn alert windows.
from datetime import datetime, timezone
import boto3
def calculate_burn_rate(monthly_budget: float) -> dict:
"""
Calculate current cost burn rate against monthly budget.
Uses AWS Cost Explorer for month-to-date actuals.
"""
ce = boto3.client("ce", region_name="us-east-1")
now = datetime.now(timezone.utc)
start = now.replace(day=1, hour=0, minute=0, second=0, microsecond=0)
days_elapsed = (now - start).days + 1
days_in_month = 30 # approximate
resp = ce.get_cost_and_usage(
TimePeriod={
"Start": start.strftime("%Y-%m-%d"),
"End": now.strftime("%Y-%m-%d"),
},
Granularity="MONTHLY",
Metrics=["UnblendedCost"],
Filter={"Not": {"Dimensions": {
"Key": "RECORD_TYPE",
"Values": ["Credit", "Refund", "Tax"],
}}},
)
mtd_cost = float(resp["ResultsByTime"][0]["Total"]["UnblendedCost"]["Amount"])
expected_mtd = monthly_budget * (days_elapsed / days_in_month)
burn_rate = mtd_cost / expected_mtd if expected_mtd > 0 else 0
projected = mtd_cost * (days_in_month / days_elapsed)
return {
"days_elapsed": days_elapsed,
"mtd_cost": round(mtd_cost, 2),
"expected_mtd": round(expected_mtd, 2),
"burn_rate": round(burn_rate, 3),
"projected_total": round(projected, 2),
"monthly_budget": monthly_budget,
"budget_surplus": round(monthly_budget - projected, 2),
}
def burn_rate_status(burn_rate: float) -> str:
if burn_rate >= 2.0: return "CRITICAL"
if burn_rate >= 1.5: return "WARNING"
if burn_rate >= 1.1: return "ELEVATED"
return "OK"
if __name__ == "__main__":
result = calculate_burn_rate(50000)
status = burn_rate_status(result["burn_rate"])
print(f"Status: {status}")
print(f"Burn rate: {result['burn_rate']}x")
print(f"MTD cost: ${result['mtd_cost']:,.2f} (expected ${result['expected_mtd']:,.2f})")
print(f"Projected: ${result['projected_total']:,.2f} vs budget ${result['monthly_budget']:,.2f}")AWS Budgets: The Native Alert Layer
# Terraform — multi-threshold cost budget with SNS alerts
resource "aws_budgets_budget" "monthly" {
name = "monthly-cost-budget"
budget_type = "COST"
limit_amount = "50000"
limit_unit = "USD"
time_unit = "MONTHLY"
# Warning threshold — 75% consumed or burn rate implies overage
notification {
comparison_operator = "GREATER_THAN"
threshold = 75
threshold_type = "PERCENTAGE"
notification_type = "ACTUAL"
subscriber_sns_topic_arns = [aws_sns_topic.cost_alerts.arn]
}
# Alert on forecast exceeding budget
notification {
comparison_operator = "GREATER_THAN"
threshold = 100
threshold_type = "PERCENTAGE"
notification_type = "FORECASTED"
subscriber_sns_topic_arns = [aws_sns_topic.cost_alerts.arn]
}
# Critical — actual spend exceeds budget
notification {
comparison_operator = "GREATER_THAN"
threshold = 100
threshold_type = "PERCENTAGE"
notification_type = "ACTUAL"
subscriber_sns_topic_arns = [aws_sns_topic.cost_alerts_critical.arn]
}
}
resource "aws_budgets_budget" "per_team" {
for_each = var.team_budgets # map of team -> monthly budget amount
name = "team-${each.key}-monthly"
budget_type = "COST"
limit_amount = tostring(each.value)
limit_unit = "USD"
time_unit = "MONTHLY"
cost_filter {
name = "TagKeyValue"
values = ["team$${each.key}"]
}
notification {
comparison_operator = "GREATER_THAN"
threshold = 80
threshold_type = "PERCENTAGE"
notification_type = "FORECASTED"
subscriber_email_addresses = ["finops@example.com"]
}
}Cost SLO Definition Template
Treat cost targets as SLOs in your service documentation:
# service-cost-slo.yaml
service: payment-api
owner: payments-team
budget_period: monthly
slos:
- name: compute-cost-per-request
description: "Compute spend per successful API request"
metric: "aws_cost_usd{team='payments', service='EC2'} / sum(api_requests_total{status!~'5..'})"
target: 0.00080 # $0.0008 per request
window: 30d
alert_threshold: 0.00100 # 25% over target triggers review
- name: monthly-budget-burn-rate
description: "Projected end-of-month spend vs monthly budget"
metric: "projected_monthly_cost / monthly_budget"
target: 1.00 # <= 100% of budget
window: 7d
alert_threshold: 1.20 # warn at 20% over projection
- name: unallocated-spend-ratio
description: "Fraction of spend with no team tag"
metric: "cost{team='untagged'} / total_cost"
target: 0.05 # < 5% untagged
window: 7d
alert_threshold: 0.10The Cost Incident Workflow
sequenceDiagram
participant Budget as AWS Budget Alert
participant SNS as SNS Topic
participant Lambda as Triage Lambda
participant Slack as Slack #finops-alerts
participant OC as On-Call Engineer
participant CUR as CUR / Athena
Budget->>SNS: Threshold breached
SNS->>Lambda: Trigger triage
Lambda->>CUR: Top services + resources (last 24h)
Lambda->>Slack: Alert with triage summary
Slack->>OC: Notification with context
OC->>CUR: Investigate resource IDs
OC->>Slack: Post findings + action taken
Note over OC,Slack: Close loop in < 4 hours for CRITICAL
Lambda triage function that attaches context to budget alerts:
import boto3, json
from datetime import datetime, timedelta
def lambda_handler(event, context):
"""Triggered by SNS budget alert. Fetches top cost drivers and posts to Slack."""
ce = boto3.client("ce", region_name="us-east-1")
ssm = boto3.client("ssm")
end = datetime.utcnow().date()
start = end - timedelta(days=1)
top_services = ce.get_cost_and_usage(
TimePeriod={"Start": str(start), "End": str(end)},
Granularity="DAILY",
Metrics=["UnblendedCost"],
GroupBy=[{"Type": "DIMENSION", "Key": "SERVICE"}],
)
lines = ["*Budget Alert — Top Services (Last 24h)*", "```"]
for group in sorted(
top_services["ResultsByTime"][0]["Groups"],
key=lambda x: float(x["Metrics"]["UnblendedCost"]["Amount"]),
reverse=True
)[:8]:
svc = group["Keys"][0]
cost = float(group["Metrics"]["UnblendedCost"]["Amount"])
lines.append(f"{svc:40s} ${cost:>9,.2f}")
lines.append("```")
lines.append("Run `make cost-triage` for resource-level breakdown.")
message = "\n".join(lines)
webhook = ssm.get_parameter(Name="/finops/slack-webhook", WithDecryption=True)["Parameter"]["Value"]
import urllib.request
payload = json.dumps({"text": message}).encode()
req = urllib.request.Request(webhook, data=payload,
headers={"Content-Type": "application/json"})
urllib.request.urlopen(req)
return {"status": "ok"}Cost SLO Dashboard Components
flowchart TD
subgraph Dashboard["Cost SLO Dashboard"]
BR[Burn Rate Gauge - 1.0x target]
MTD[MTD vs Expected Spend]
PROJ[30-day Projection vs Budget]
TPR[Team Budget Status - RAG]
TOP[Top Cost Movers - 7d]
EFF[Cost per Request Trend]
end
subgraph Alerts["Alert Channels"]
WARN[Slack warning at 1.5x burn]
CRIT[PagerDuty critical at 2.0x burn]
FORE[Budget forecast alert at 100%]
end
Dashboard --> Alerts
Key Takeaways
- Burn rate is the correct framing for budget alerts — a threshold on total spend gives you a single alert when it is too late; burn rate gives you time to act.
- Per-unit cost metrics (cost per request, cost per active user) are the only way to separate efficiency from growth; raw dollar increases are expected as you scale, deteriorating unit economics are not.
- AWS Budgets with forecasted-spend alerts fire before you overspend, not after — always add a
FORECASTEDthreshold alongsideACTUALthresholds. - A cost incident workflow with a 4-hour response SLA for critical burns treats overspend with the same seriousness as a P1 availability incident — this cultural signal is more important than the specific alert threshold chosen.
- The cost SLO YAML in your service repository makes cost targets a first-class engineering artifact, visible in code review and tracked in the same system as reliability SLOs.
- This series closes where it started — the bill is a feature. When cost has burn rates, alerts, incidents, and SLOs, it behaves like every other reliability concern your engineering organization already knows how to manage.
Series
Cloud Cost Engineering← Part 9
Showback vs Chargeback
End of series