Observability in Depth

Cost and the ROI Conversation

Ravinder·October 3, 2025·6 min read

ObservabilityTelemetryCost OptimizationFinOpsROI

Series

Observability in Depth

Part 10 of 10

← Part 9

Alert Design

End of series

Every observability series should end with the conversation teams hate having: how much does all this cost, and is it worth it? The typical engineering answer — "we need it, it's not negotiable" — is correct but unhelpful. Finance hears that as "we don't know what we're spending or why."

The better answer is a cost model with line items, a set of controllable levers, and a benefits frame that translates technical value into business outcomes. That is what this post builds.

Where the Money Goes

Observability spend clusters into five categories. Knowing which is largest in your environment determines which levers matter most:

pie title Typical Observability Cost Distribution "Log ingest + storage" : 40 "Metrics series (TSDB)" : 25 "Trace ingest + storage" : 20 "Query compute" : 10 "Synthetic + RUM" : 5

Log ingest dominates because logs are high-volume and teams rarely apply sampling. The good news is that log cost is the most controllable — sampling and retention tiering (covered in Post 2) regularly reduce it by 60–80%.

Building a Cost Model Per Signal

You cannot reduce what you cannot measure. Start by attributing cost to each signal type.

Metrics cost (Grafana Cloud / Prometheus):

Grafana Cloud charges per active series per month. Query your series count by job to attribute cost:

# Series count per service — multiply by your per-series rate
topk(20,
  count by (job) ({__name__=~".+"})
)

At $8/million series-month (approximate Grafana Cloud pricing as of mid-2025), a service with 500k series costs $4/month. Multiply by your service count.

Log cost (Loki / Datadog / Splunk):

Log platforms charge on ingest GB. Query your ingest rate by service:

# Loki: ingest bytes by service label over last 7 days
sum by (service) (
  bytes_over_time({env="production"}[7d])
)

For Datadog, the same data is in the Usage & Cost dashboard. Export it as CSV for the finance conversation.

Trace cost (Tempo / Datadog APM / Honeycomb):

Trace platforms typically charge on spans stored per month or GB ingested. Compute your span volume:

# OTel Collector spans accepted per service
sum by (service_name) (
  rate(otelcol_receiver_accepted_spans_total[1h])
) * 3600 * 24 * 30  # monthly projection

The Cost-Per-Signal Reduction Playbook

Signal	Lever	Typical reduction	Risk
Logs	Sampling (trace-aware)	60–80%	Lose debug context if misconfigured
Logs	Retention tiering	20–40%	Low — older logs just move to cold
Metrics	Drop high-cardinality labels	40–70%	Lose label dimension in queries
Metrics	Recording rules + raw drop	30–50%	Pre-aggregate before dropping
Traces	Tail sampling (OTel)	80–95%	Lose tail of distribution if policy wrong
Traces	Head sampling	80–95%	Lose 100% of non-sampled traces

The safest starting point is always tail sampling for traces and retention tiering for logs — both preserve full fidelity for sampled data while reducing storage spend.

Implementing Cost Attribution by Team

If you run a multi-team platform, cost attribution drives accountability. Add a team label to every resource and aggregate in Prometheus:

# OTel Collector: enrich spans with team label from k8s namespace
processors:
  k8sattributes:
    extract:
      metadata:
        - k8s.namespace.name
        - k8s.deployment.name
  transform/team_label:
    trace_statements:
      - context: resource
        statements:
          - set(attributes["team"],
              ExtractPatterns(attributes["k8s.namespace.name"],
                "^(?P<team>[a-z]+)-.*")["team"])

Then build a Grafana dashboard showing cost per team:

# Estimated monthly log cost by team ($0.50/GB, example rate)
sum by (team) (
  bytes_over_time({env="production"} | label_format team="$team" [30d])
) / 1e9 * 0.50

Show this dashboard in monthly engineering leadership reviews. Teams that see their cost line item consistently find optimizations.

The ROI Frame for Finance

Finance responds to one of three frames: cost avoidance, revenue protection, or productivity. For observability, all three apply.

Frame 1: Cost avoidance via faster MTTR

Average incident cost = (revenue at risk per hour) × (MTTR hours)
                      + (engineer cost per incident)
 
Before observability investment:
  MTTR = 4h average
  Revenue at risk = $50,000/h
  Engineer cost = $2,000/incident
  Average incident cost = $202,000
 
After investment (MTTR drops to 45 minutes):
  Average incident cost = $37,500 + $500 = $38,000
  Savings per incident = $164,000
 
With 8 major incidents/year:
  Annual savings = $1,312,000
  Observability spend = $180,000/year
  ROI = 628%

This frame is most effective because it uses numbers that finance already tracks — P1 incident costs and MTTR appear in SLA reports and post-mortems.

Frame 2: Revenue protection via SLO compliance

If your SLO is tied to customer contracts, SLO breaches have direct revenue implications. Quantify it:

SLA penalty per 0.1% availability miss = $X (from contracts)
Current SLO attainment without observability = estimated from historical data
Current SLO attainment with observability = measured
Delta × SLA penalty rate = observability value

Frame 3: Engineering productivity

Time spent debugging without observability data is a measurable productivity tax:

Engineers × hours/week debugging production issues (self-reported)
× hourly fully-loaded engineer cost
= annual debugging cost
 
With full observability, debugging time typically drops 40–60%.

Grafana Dashboard: The Observability Bill

Build a single "observability cost" dashboard visible to engineering leadership:

{
  "title": "Observability Cost Overview",
  "panels": [
    {
      "title": "Monthly Log Ingest by Service (GB)",
      "type": "bargauge",
      "targets": [{"expr": "sum by (service) (bytes_over_time({env='production'}[30d]) / 1e9)"}]
    },
    {
      "title": "Active Metric Series by Job",
      "type": "bargauge",
      "targets": [{"expr": "topk(15, count by (job)({__name__=~'.+'}))"}]
    },
    {
      "title": "Span Volume by Service (M/month)",
      "type": "bargauge",
      "targets": [{"expr": "sum by(service_name)(rate(otelcol_exporter_sent_spans_total[30d])) * 2592000 / 1e6"}]
    },
    {
      "title": "Estimated Monthly Total Cost",
      "type": "stat",
      "targets": [{"expr": "scalar(observability_estimated_monthly_cost_usd)"}]
    }
  ]
}

What Good Looks Like

Benchmarks for a typical 50-engineer organization:

Metric	Concerning	Healthy
Observability as % of infra spend	> 15%	5–10%
Log cost per million req	> $2.00	$0.20–$0.60
Mean trace sampling ratio	< 1% (too low)	5–20%
MTTR for P1 incidents	> 2h	< 30m
Alert noise ratio	> 30%	< 10%

Key Takeaways

Logs dominate observability spend (typically 40%) and are the most controllable — sampling and retention tiering return the highest cost-reduction per engineering hour invested.
Cost attribution by team makes spend visible and creates organic pressure for efficiency; a team that sees their $8k/month log bill finds optimizations that a central platform team would miss.
The ROI frame finance responds to is MTTR-based cost avoidance — quantify the average incident cost before and after, multiply by annual incident count, compare to observability spend.
Tail sampling for traces and retention tiering for logs are the lowest-risk, highest-impact reduction levers — they preserve full fidelity for sampled data.
An "observability bill" dashboard visible to engineering leadership is more effective at driving cost discipline than quarterly budget reviews.
Observability spend at 5–10% of total infrastructure cost is the healthy range; above 15% indicates instrumentation sprawl, unchecked cardinality, or sampling policy gaps that this series has equipped you to fix.

Series

Observability in Depth

Part 10 of 10

← Part 9

Alert Design

End of series