Cost and the ROI Conversation
Series
Observability in Depth← Part 9
Alert Design
Every observability series should end with the conversation teams hate having: how much does all this cost, and is it worth it? The typical engineering answer — "we need it, it's not negotiable" — is correct but unhelpful. Finance hears that as "we don't know what we're spending or why."
The better answer is a cost model with line items, a set of controllable levers, and a benefits frame that translates technical value into business outcomes. That is what this post builds.
Where the Money Goes
Observability spend clusters into five categories. Knowing which is largest in your environment determines which levers matter most:
Log ingest dominates because logs are high-volume and teams rarely apply sampling. The good news is that log cost is the most controllable — sampling and retention tiering (covered in Post 2) regularly reduce it by 60–80%.
Building a Cost Model Per Signal
You cannot reduce what you cannot measure. Start by attributing cost to each signal type.
Metrics cost (Grafana Cloud / Prometheus):
Grafana Cloud charges per active series per month. Query your series count by job to attribute cost:
# Series count per service — multiply by your per-series rate
topk(20,
count by (job) ({__name__=~".+"})
)At $8/million series-month (approximate Grafana Cloud pricing as of mid-2025), a service with 500k series costs $4/month. Multiply by your service count.
Log cost (Loki / Datadog / Splunk):
Log platforms charge on ingest GB. Query your ingest rate by service:
# Loki: ingest bytes by service label over last 7 days
sum by (service) (
bytes_over_time({env="production"}[7d])
)For Datadog, the same data is in the Usage & Cost dashboard. Export it as CSV for the finance conversation.
Trace cost (Tempo / Datadog APM / Honeycomb):
Trace platforms typically charge on spans stored per month or GB ingested. Compute your span volume:
# OTel Collector spans accepted per service
sum by (service_name) (
rate(otelcol_receiver_accepted_spans_total[1h])
) * 3600 * 24 * 30 # monthly projectionThe Cost-Per-Signal Reduction Playbook
| Signal | Lever | Typical reduction | Risk |
|---|---|---|---|
| Logs | Sampling (trace-aware) | 60–80% | Lose debug context if misconfigured |
| Logs | Retention tiering | 20–40% | Low — older logs just move to cold |
| Metrics | Drop high-cardinality labels | 40–70% | Lose label dimension in queries |
| Metrics | Recording rules + raw drop | 30–50% | Pre-aggregate before dropping |
| Traces | Tail sampling (OTel) | 80–95% | Lose tail of distribution if policy wrong |
| Traces | Head sampling | 80–95% | Lose 100% of non-sampled traces |
The safest starting point is always tail sampling for traces and retention tiering for logs — both preserve full fidelity for sampled data while reducing storage spend.
Implementing Cost Attribution by Team
If you run a multi-team platform, cost attribution drives accountability. Add a team label to every resource and aggregate in Prometheus:
# OTel Collector: enrich spans with team label from k8s namespace
processors:
k8sattributes:
extract:
metadata:
- k8s.namespace.name
- k8s.deployment.name
transform/team_label:
trace_statements:
- context: resource
statements:
- set(attributes["team"],
ExtractPatterns(attributes["k8s.namespace.name"],
"^(?P<team>[a-z]+)-.*")["team"])Then build a Grafana dashboard showing cost per team:
# Estimated monthly log cost by team ($0.50/GB, example rate)
sum by (team) (
bytes_over_time({env="production"} | label_format team="$team" [30d])
) / 1e9 * 0.50Show this dashboard in monthly engineering leadership reviews. Teams that see their cost line item consistently find optimizations.
The ROI Frame for Finance
Finance responds to one of three frames: cost avoidance, revenue protection, or productivity. For observability, all three apply.
Frame 1: Cost avoidance via faster MTTR
Average incident cost = (revenue at risk per hour) × (MTTR hours)
+ (engineer cost per incident)
Before observability investment:
MTTR = 4h average
Revenue at risk = $50,000/h
Engineer cost = $2,000/incident
Average incident cost = $202,000
After investment (MTTR drops to 45 minutes):
Average incident cost = $37,500 + $500 = $38,000
Savings per incident = $164,000
With 8 major incidents/year:
Annual savings = $1,312,000
Observability spend = $180,000/year
ROI = 628%This frame is most effective because it uses numbers that finance already tracks — P1 incident costs and MTTR appear in SLA reports and post-mortems.
Frame 2: Revenue protection via SLO compliance
If your SLO is tied to customer contracts, SLO breaches have direct revenue implications. Quantify it:
SLA penalty per 0.1% availability miss = $X (from contracts)
Current SLO attainment without observability = estimated from historical data
Current SLO attainment with observability = measured
Delta × SLA penalty rate = observability valueFrame 3: Engineering productivity
Time spent debugging without observability data is a measurable productivity tax:
Engineers × hours/week debugging production issues (self-reported)
× hourly fully-loaded engineer cost
= annual debugging cost
With full observability, debugging time typically drops 40–60%.Grafana Dashboard: The Observability Bill
Build a single "observability cost" dashboard visible to engineering leadership:
{
"title": "Observability Cost Overview",
"panels": [
{
"title": "Monthly Log Ingest by Service (GB)",
"type": "bargauge",
"targets": [{"expr": "sum by (service) (bytes_over_time({env='production'}[30d]) / 1e9)"}]
},
{
"title": "Active Metric Series by Job",
"type": "bargauge",
"targets": [{"expr": "topk(15, count by (job)({__name__=~'.+'}))"}]
},
{
"title": "Span Volume by Service (M/month)",
"type": "bargauge",
"targets": [{"expr": "sum by(service_name)(rate(otelcol_exporter_sent_spans_total[30d])) * 2592000 / 1e6"}]
},
{
"title": "Estimated Monthly Total Cost",
"type": "stat",
"targets": [{"expr": "scalar(observability_estimated_monthly_cost_usd)"}]
}
]
}What Good Looks Like
Benchmarks for a typical 50-engineer organization:
| Metric | Concerning | Healthy |
|---|---|---|
| Observability as % of infra spend | > 15% | 5–10% |
| Log cost per million req | > $2.00 | $0.20–$0.60 |
| Mean trace sampling ratio | < 1% (too low) | 5–20% |
| MTTR for P1 incidents | > 2h | < 30m |
| Alert noise ratio | > 30% | < 10% |
Key Takeaways
- Logs dominate observability spend (typically 40%) and are the most controllable — sampling and retention tiering return the highest cost-reduction per engineering hour invested.
- Cost attribution by team makes spend visible and creates organic pressure for efficiency; a team that sees their $8k/month log bill finds optimizations that a central platform team would miss.
- The ROI frame finance responds to is MTTR-based cost avoidance — quantify the average incident cost before and after, multiply by annual incident count, compare to observability spend.
- Tail sampling for traces and retention tiering for logs are the lowest-risk, highest-impact reduction levers — they preserve full fidelity for sampled data.
- An "observability bill" dashboard visible to engineering leadership is more effective at driving cost discipline than quarterly budget reviews.
- Observability spend at 5–10% of total infrastructure cost is the healthy range; above 15% indicates instrumentation sprawl, unchecked cardinality, or sampling policy gaps that this series has equipped you to fix.
Series
Observability in Depth← Part 9
Alert Design