Skip to main content
Engineering

Cardinality: The Metric You'll Wish You Watched

Ravinder··8 min read
EngineeringObservabilityMetricsCost OptimizationPrometheus
Share:
Cardinality: The Metric You'll Wish You Watched

The bill arrived on a Tuesday. Our Prometheus-compatible managed metrics service had been charging us based on active time series. We had 40 million. Our SRE thought we had around 4 million. The discrepancy was a single label added six weeks earlier by a backend engineer who had named it user_id. It was populated on every request. We had 800,000 active users.

800,000 users × 12 existing label combinations = 9.6 million new time series. Overnight.

The engineer who added it didn't know what cardinality meant in the context of metrics. The PR reviewer didn't catch it. The alert that should have fired on cardinality growth didn't exist. The bill was the first signal.

This is cardinality's defining property: it is exponential, it is quiet, and by the time you see it in your monitoring bill, the damage is weeks old.

How Cardinality Works

A time series is the unique combination of a metric name and its label set. Every additional label value you introduce multiplies your series count by the number of distinct values that label can take.

http_request_duration_seconds
  {method="GET", status="200", endpoint="/api/orders"} → 1 series
  {method="GET", status="404", endpoint="/api/orders"} → 1 series
  {method="POST", status="200", endpoint="/api/orders"} → 1 series
  ...

With method (4 values), status (5 values), and endpoint (20 values), you have 4 × 5 × 20 = 400 series for that one metric. Add user_id with 800,000 values and you have 400 × 800,000 = 320,000,000 series. Prometheus will accept this, at least briefly, before falling over.

xychart-beta title "Series Count Explosion by Label Cardinality" x-axis ["Baseline\n(3 labels)", "+region\n(8 vals)", "+env\n(3 vals)", "+user_id\n(800K vals)"] y-axis "Time Series (log scale M)" 0 --> 100 bar [0.0004, 0.003, 0.009, 72]

The exponential nature means the damage is rarely incremental. It looks fine, then it looks fine, then it's catastrophic.

The Cost Model

Different metrics backends charge differently, but they all penalize cardinality:

  • Prometheus (self-hosted): Memory scales with active series. Prometheus keeps the last 2 hours of data in RAM. At 800 bytes per series, 40M series = 32 GB RAM minimum. Query performance degrades sharply above ~10M series.
  • Managed Prometheus / Thanos: Active series-month pricing. $0.30–$0.50 per million series per month at common providers. 40M series = $12–$20K/month on metrics alone.
  • Datadog custom metrics: Charges per unique metric-tag combination, per host, per hour. A single high-cardinality tag on a heavily used metric can double your custom metrics bill.

The insidious part is that metrics storage is usually provisioned in advance. You don't see cost pressure until the next billing cycle or until you hit storage limits and your monitoring stack starts dropping data — which is when you need it most.

Label Hygiene: The Rules

Good label design is a discipline, not a preference. Here are the rules we enforce.

Rule 1: Labels describe resource attributes, not resource identities.

# BAD: identifies a specific entity
http_requests_total{user_id="usr_77281", order_id="ord_9f2c8a"}
 
# GOOD: describes a property of the request class
http_requests_total{user_tier="enterprise", endpoint_group="orders"}

A label should answer "what kind of thing is this?" not "which specific thing is this?". If the answer is "which specific thing," that data belongs in a trace or a domain event — not a metric label.

Rule 2: Bound your label value sets before you ship.

Before adding a label, explicitly enumerate the distinct values it can take. If you cannot list them all in under 30 seconds, the cardinality is probably too high.

# Safe label: bounded set, deterministic values
ALLOWED_PAYMENT_METHODS = {"card", "bank_transfer", "wallet", "crypto", "unknown"}
 
def record_payment_attempt(method: str, status: str):
    safe_method = method if method in ALLOWED_PAYMENT_METHODS else "unknown"
    payment_attempts_total.labels(
        method=safe_method,
        status=status
    ).inc()

The unknown bucket is not a workaround — it's load-bearing. It prevents a misbehaving upstream from injecting arbitrary strings into your label space.

Rule 3: No request-scoped IDs in labels.

Request IDs, trace IDs, session IDs, user IDs, order IDs — none of these belong in metric labels. They have unbounded cardinality by definition. If you need to correlate a metric with a specific request, use exemplars.

Rule 4: Audit before merge.

Add cardinality review to your observability PR checklist. For any PR that adds or modifies a metric label, require the author to state the expected distinct value count.

## Observability Checklist
- [ ] Any new metric labels have a bounded value set (state max expected cardinality)
- [ ] No request-scoped IDs (user_id, order_id, request_id, trace_id) in labels
- [ ] New metrics have a corresponding Grafana panel or alert
- [ ] Label names follow the team naming convention

Exemplars: The Bridge Between Metrics and Traces

Exemplars solve the problem that cardinality tries (badly) to solve. You want to go from "I see elevated p99 latency in the checkout endpoint group" to "show me a specific request that was slow." Without exemplars, you'd add a label like request_id to your histogram — instantly exploding cardinality.

Exemplars attach a sample trace ID to specific histogram observations. You get one trace reference per time window per bucket, stored separately from the main series, without contributing to cardinality.

import (
    "github.com/prometheus/client_golang/prometheus"
    "go.opentelemetry.io/otel/trace"
)
 
var checkoutDuration = prometheus.NewHistogramVec(
    prometheus.HistogramOpts{
        Name:    "checkout_request_duration_seconds",
        Help:    "Checkout request duration",
        Buckets: prometheus.DefBuckets,
    },
    []string{"endpoint_group", "status_class"},
)
 
func (h *CheckoutHandler) handleRequest(w http.ResponseWriter, r *http.Request) {
    start := time.Now()
    spanCtx := trace.SpanFromContext(r.Context()).SpanContext()
 
    // ... handle request ...
 
    duration := time.Since(start).Seconds()
    checkoutDuration.With(prometheus.Labels{
        "endpoint_group": "orders",
        "status_class":   "2xx",
    }).(prometheus.ExemplarObserver).ObserveWithExemplar(
        duration,
        prometheus.Labels{
            "traceID": spanCtx.TraceID().String(),
        },
    )
}

In Grafana, exemplars render as dots on your histogram panel. Click the dot, get a trace ID, jump directly to the Tempo/Jaeger trace for that specific slow request. High cardinality problem solved at the right layer.

Sampling High-Cardinality Dimensions

Sometimes you genuinely need high-cardinality data in your metrics system — for example, per-tenant SLO tracking where you have 5,000 tenants. Sampling is the right approach.

Reservoir sampling on label values gives you a representative subset without exploding series count:

import random
from collections import defaultdict
 
class CardinalitySampler:
    """
    Emits full metric for high-volume tenants,
    samples low-volume tenants into a 'sampled' bucket.
    """
    HIGH_VOLUME_THRESHOLD = 1000  # requests/minute
    SAMPLE_RATE = 0.1  # 10% of low-volume tenants
 
    def __init__(self, tenant_volume_fn):
        self.tenant_volume = tenant_volume_fn
 
    def get_tenant_label(self, tenant_id: str) -> str:
        volume = self.tenant_volume(tenant_id)
        if volume >= self.HIGH_VOLUME_THRESHOLD:
            return tenant_id  # Emit as named series
        if random.random() < self.SAMPLE_RATE:
            return f"sampled_{tenant_id}"  # Emit at reduced rate
        return None  # Drop this observation
 
    def record_request(self, tenant_id: str, duration: float, status: str):
        label = self.get_tenant_label(tenant_id)
        if label is not None:
            request_duration.labels(tenant=label, status=status).observe(duration)

This gives you exact metrics for your largest tenants (where accuracy matters most) and sampled metrics for the long tail, with bounded total series count.

Alerting on Cardinality Itself

The most important operational practice: alert on cardinality growth before it becomes a cost event.

# Prometheus alerting rules for cardinality
groups:
  - name: cardinality_alerts
    rules:
      - alert: HighCardinalityMetricLabelSet
        expr: |
          topk(10,
            sum by (__name__) (
              count by (__name__, job) ({__name__=~".+"})
            )
          ) > 500000
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Metric {{ $labels.__name__ }} has >500K series"
          description: "This metric has {{ $value }} active series. Investigate label cardinality."
 
      - alert: CardinalityGrowthSpike
        expr: |
          (
            sum(prometheus_tsdb_head_series)
            -
            sum(prometheus_tsdb_head_series offset 1h)
          ) / sum(prometheus_tsdb_head_series offset 1h) > 0.25
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Prometheus series count grew >25% in the last hour"
          description: "Current series: {{ $value }}. A label with high cardinality may have been added."

The second alert — growth rate, not absolute count — is the one that would have caught the user_id incident. A 25% growth in an hour is anomalous regardless of your baseline.

To find the offending metric after the alert fires:

# Top 20 metrics by series count
topk(20, count by (__name__) ({__name__=~".+"}))
# For a specific metric, find which label has the highest cardinality
count by (user_id) (http_request_duration_seconds_bucket)

Key Takeaways

  • Cardinality is multiplicative, not additive. Adding a label with 800K values to a metric with 400 existing series creates 320M series — not 800,400. Think in products, not sums.
  • Labels describe the class of a resource, not its identity. Request IDs, user IDs, and order IDs belong in traces and events — not in metric labels. If you need to drill from a metric to a specific request, use exemplars.
  • Bound every label's value set before merging. If you cannot enumerate all possible values in 30 seconds, the cardinality is too high. Add an unknown catch-all bucket to prevent unbounded injection from upstream.
  • Exemplars bridge the gap between metrics and traces without contributing to series count. They attach a sample trace ID to histogram observations, giving you a "show me a specific example" path without the cardinality cost.
  • Alert on cardinality growth rate — a 25% increase in series count within an hour is a signal that something broke, regardless of absolute baseline. Don't wait for the billing cycle to find out.
  • Reservoir sampling is the right pattern when you genuinely need per-entity metrics at scale: emit exact series for high-volume entities, sample the long tail, and bound your total series count explicitly.