Observability in Depth

Metrics: Cardinality and the Bill

Ravinder·August 15, 2025·5 min read

ObservabilityTelemetryMetricsPrometheusCardinality

Series

Observability in Depth

Part 3 of 10

← Part 2

Logs: Structured, Sampled, Retained

Part 4 →

Tracing: End-to-End, Including the Queue

Prometheus metrics are cheap — until they aren't. The cost model is deceptively simple: you pay per active time series, and the number of series is the Cartesian product of all label value combinations. Add one user_id label to a metric with 100k users and you have just multiplied your series count by 100,000. Most TSDB bills that surprise engineering teams trace back to this one mistake.

This post unpacks cardinality, shows how to diagnose and fix label bloat, and introduces exemplars — the lightweight mechanism that lets you go from a metric anomaly directly to a trace without high-cardinality overhead.

The Cardinality Equation

flowchart LR M["http_requests_total"] --> L1["method: GET/POST/PUT/DELETE\n(4 values)"] M --> L2["status: 200/400/429/500/503\n(5 values)"] M --> L3["service: checkout/payment/catalog\n(3 values)"] M --> L4["user_id: 100,000 values ⚠️"] L1 & L2 & L3 --> Safe["60 series\n✓ Manageable"] L4 --> Explosion["60 × 100,000 = 6,000,000 series\n✗ TSDB explosion"]

The rule: label values must be bounded and low-cardinality. Good labels describe categories (method, status class, region, service name). Bad labels describe individuals (user_id, order_id, request_id, IP address, full URL path with query params).

Diagnosing Cardinality Problems

Before you can fix cardinality you need to see it. Prometheus exposes its own internals:

# Top 10 metrics by series count
topk(10, count by (__name__)({__name__=~".+"}))
 
# Series count per job
count by (job) ({__name__=~".+"})
 
# Rate of new series creation (cardinality growth)
rate(prometheus_tsdb_head_series[5m])

In Grafana Mimir or Thanos, use the cardinality analysis endpoint:

# Mimir cardinality API
curl -H "X-Scope-OrgID: tenant1" \
  "http://mimir:8080/api/v1/cardinality/label_names?limit=20"
 
# Returns label names sorted by series count
# {"label_names":[{"label_name":"user_id","series_count":4280312}, ...]}

For ongoing governance, alert on series growth rate:

# Prometheus alert: cardinality spike
groups:
  - name: cardinality
    rules:
      - alert: CardinalityExplosion
        expr: |
          rate(prometheus_tsdb_head_series[10m]) > 1000
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "TSDB series growing at {{ $value }}/s — check for label bloat"

Fixing Label Bloat: Practical Patterns

Pattern 1: Drop the label entirely at scrape time

# prometheus.yml — metric_relabel_configs
scrape_configs:
  - job_name: checkout-api
    metric_relabel_configs:
      # Drop user_id label before storage
      - action: labeldrop
        regex: user_id
 
      # Truncate high-cardinality URL paths to first two segments
      - source_labels: [path]
        regex: '^(/[^/]+/[^/]+).*'
        target_label: path
        replacement: '$1'

Pattern 2: Bucket instead of enumerate

If you need user-level insight, bucket users into cohorts:

// Instead of label{user_id: "usr_8823"}
// Use label{user_tier: "premium"} or label{user_cohort: "cohort_42"}
func userCohort(userID string) string {
    h := fnv.New32a()
    h.Write([]byte(userID))
    return fmt.Sprintf("cohort_%d", h.Sum32()%100) // 100 cohorts
}

Pattern 3: OTel Collector view API — drop at SDK level

# OTel Collector transform processor
processors:
  transform/drop_high_card:
    metric_statements:
      - context: datapoint
        statements:
          - delete_key(attributes, "user_id")
          - delete_key(attributes, "request_id")
          - delete_key(attributes, "trace_id")

Note: trace_id as a metric label is a classic mistake. Use exemplars instead (see below).

Histogram Buckets: The Silent Cardinality Multiplier

Every histogram bucket is its own time series. A histogram with 15 buckets across 4 label combinations generates 15 × 4 + 2 (sum/count) = 62 series per metric name. Multiply across services and histograms dominate your series count.

# Count histogram series
count({__name__=~".+_bucket"})

Prefer native histograms (Prometheus 2.40+), which store bucket data compactly without the per-bucket series explosion:

// Go — register a native histogram
import "github.com/prometheus/client_golang/prometheus"
 
var requestDuration = prometheus.NewHistogram(prometheus.HistogramOpts{
    Name:                        "http_request_duration_seconds",
    Help:                        "HTTP request latency",
    NativeHistogramBucketFactor: 1.1, // enables native histogram
})

Native histograms reduce series count by ~10× for typical histogram configurations.

Exemplars: The Bridge to Traces

Exemplars let you attach a trace_id to a specific histogram observation without creating a per-trace-id time series. The exemplar is stored as metadata on the bucket, not as a label.

// Go — emit exemplar with histogram observation
requestDuration.(prometheus.ExemplarObserver).ObserveWithExemplar(
    elapsed.Seconds(),
    prometheus.Labels{
        "trace_id": span.SpanContext().TraceID().String(),
    },
)

Enable exemplar storage in Prometheus:

# prometheus.yml
global:
  scrape_interval: 15s
 
feature_flags:
  - exemplar-storage
 
storage:
  exemplars:
    max_exemplars: 100000

In Grafana, enable the exemplar overlay on any histogram panel:

{
  "type": "timeseries",
  "options": {
    "tooltip": {"mode": "single"}
  },
  "fieldConfig": {
    "defaults": {
      "custom": {
        "hideFrom": {"tooltip": false}
      }
    }
  },
  "datasource": {"type": "prometheus"},
  "targets": [
    {
      "expr": "histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))",
      "exemplar": true,
      "legendFormat": "P99 latency"
    }
  ]
}

Now when P99 spikes, you hover the metric graph, click the exemplar dot, and jump straight to the trace — no grep, no correlation by hand.

Recording Rules: Pre-Aggregate Before You Query

High-cardinality raw metrics can still be useful if you pre-aggregate them into lower-cardinality recording rules before dashboards query them:

# prometheus-rules.yaml
groups:
  - name: aggregated_metrics
    interval: 60s
    rules:
      # Pre-aggregate request rate by service (drop per-pod detail)
      - record: job:http_requests:rate5m
        expr: |
          sum by (job, method, status_class) (
            rate(http_requests_total[5m])
          )
 
      # Compute error ratio at the job level
      - record: job:http_error_ratio:rate5m
        expr: |
          sum by (job) (rate(http_requests_total{status=~"5.."}[5m]))
          /
          sum by (job) (rate(http_requests_total[5m]))

Dashboards and alerts query the recording rules, not the raw series. The raw series can have shorter retention or even be dropped after the recording rule interval.

Key Takeaways

Time-series count is the product of all label value cardinalities — one unbounded label (user_id, trace_id, IP) can multiply your series count by millions.
Diagnose cardinality with topk(10, count by (__name__)({__name__=~".+"})) and Mimir's cardinality API before billing surprises you.
Drop or bucket high-cardinality labels at the Collector or scrape layer — never let them reach TSDB storage.
Native histograms (Prometheus 2.40+) eliminate per-bucket series, cutting histogram cardinality by roughly 10×.
Exemplars are the correct mechanism for trace correlation — they carry trace_id as metadata, not as a label, so they add zero time series overhead.
Recording rules are your pre-aggregation layer — dashboards should query rules, not raw high-cardinality series.

Series

Observability in Depth

Part 3 of 10

← Part 2

Logs: Structured, Sampled, Retained

Part 4 →

Tracing: End-to-End, Including the Queue