Skip to main content
Platform Engineering

Observability Defaults

Ravinder··5 min read
Platform EngineeringDevOpsIDPObservabilityOpenTelemetry
Share:
Observability Defaults

The first time a service goes on-call and the team stares at a blank Grafana dashboard during an incident, they understand the value of observability defaults. The second time, they stop shipping services without them. The platform team's job is to make sure there's no "first time" — every service that uses the golden path gets working telemetry the moment it deploys.

"Observability" in this context means three things in practice: structured logs that are queryable, metrics that answer the four golden signals (latency, traffic, errors, saturation), and distributed traces that show you where time went across service boundaries. The platform provides the plumbing. Services provide the signal.

The Architecture

One opinionated stack is better than three half-integrated options. Pick a lane:

graph TD S[Service] -->|OTLP| C[OpenTelemetry Collector] C -->|metrics| P[Prometheus / Mimir] C -->|traces| T[Tempo / Jaeger] C -->|logs| L[Loki] P --> G[Grafana] T --> G L --> G G --> A[Alert Manager] A --> PD[PagerDuty]

OpenTelemetry as the instrumentation layer means you're not locked to a specific backend. The collector handles routing. Product teams instrument once; the platform team changes backends without touching application code.

The Platform Telemetry SDK

The platform team ships a thin wrapper over the OTel SDK that handles boilerplate: exporter configuration, resource attributes, propagation setup. Product engineers import one package and get instrumentation.

# platform_telemetry/__init__.py
# Product teams: from platform_telemetry import setup_telemetry
import os
from opentelemetry import trace, metrics
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.resources import Resource, SERVICE_NAME, SERVICE_VERSION
 
 
def setup_telemetry(service_name: str, service_version: str = "unknown") -> None:
    """
    One-call telemetry bootstrap. Reads OTEL_EXPORTER_OTLP_ENDPOINT from env.
    Falls back to localhost:4317 for local development.
    """
    resource = Resource.create({
        SERVICE_NAME: service_name,
        SERVICE_VERSION: service_version,
        "deployment.environment": os.getenv("ENVIRONMENT", "dev"),
        "team.name": os.getenv("TEAM_NAME", "unknown"),
    })
 
    # Traces
    otlp_endpoint = os.getenv("OTEL_EXPORTER_OTLP_ENDPOINT", "http://localhost:4317")
    trace_provider = TracerProvider(resource=resource)
    trace_provider.add_span_processor(
        BatchSpanProcessor(OTLPSpanExporter(endpoint=otlp_endpoint))
    )
    trace.set_tracer_provider(trace_provider)
 
    # Metrics
    meter_provider = MeterProvider(resource=resource)
    metrics.set_meter_provider(meter_provider)
# In the service's main.py — two lines to get full telemetry
from platform_telemetry import setup_telemetry
setup_telemetry(service_name="payments-api", service_version="2.4.1")

The SDK reads ENVIRONMENT and TEAM_NAME from environment variables injected by the platform's deployment tooling. No per-service configuration of exporters, no per-service resource attribute setup.

Baseline Metrics: The Four Golden Signals

Every HTTP service should expose these without any custom instrumentation, via middleware:

# platform_telemetry/http_middleware.py
import time
from opentelemetry import metrics
 
meter = metrics.get_meter("platform.http")
 
request_duration = meter.create_histogram(
    "http.server.request.duration",
    unit="s",
    description="HTTP request duration in seconds",
)
 
request_count = meter.create_counter(
    "http.server.request.count",
    description="Total HTTP requests",
)
 
def telemetry_middleware(app):
    """WSGI/ASGI middleware — wrap once, instrument everything."""
    async def middleware(scope, receive, send):
        if scope["type"] != "http":
            return await app(scope, receive, send)
 
        start = time.perf_counter()
        status_code = 200
 
        async def send_wrapper(message):
            nonlocal status_code
            if message["type"] == "http.response.start":
                status_code = message["status"]
            await send(message)
 
        await app(scope, receive, send_wrapper)
 
        duration = time.perf_counter() - start
        labels = {
            "method": scope["method"],
            "route": scope.get("path", "unknown"),
            "status_code": str(status_code),
        }
        request_duration.record(duration, labels)
        request_count.add(1, labels)
 
    return middleware

Include this middleware in the scaffold (post 3). Every service that uses the template gets latency, traffic, and error rate automatically.

Dashboard-as-Code

Ship a Grafana dashboard JSON template alongside the SDK. Teams can import it with their service name and get a working dashboard immediately.

{
  "title": "Service Overview — {{service_name}}",
  "panels": [
    {
      "title": "Request Rate",
      "type": "timeseries",
      "targets": [{
        "expr": "sum(rate(http_server_request_count_total{service_name=\"{{service_name}}\"}[5m])) by (status_code)",
        "legendFormat": "{{status_code}}"
      }]
    },
    {
      "title": "P99 Latency",
      "type": "timeseries",
      "targets": [{
        "expr": "histogram_quantile(0.99, rate(http_server_request_duration_seconds_bucket{service_name=\"{{service_name}}\"}[5m]))",
        "legendFormat": "p99"
      }]
    },
    {
      "title": "Error Rate",
      "type": "stat",
      "targets": [{
        "expr": "sum(rate(http_server_request_count_total{service_name=\"{{service_name}}\",status_code=~\"5..\"}[5m])) / sum(rate(http_server_request_count_total{service_name=\"{{service_name}}\"}[5m]))",
        "legendFormat": "error %"
      }]
    }
  ]
}

Automate dashboard provisioning in CI: when a catalog-info.yaml is detected, create the Grafana dashboard via the Grafana API.

Baseline Alert Rules

Don't wait for teams to write their first alert. Ship defaults that catch the obvious problems:

# platform-alerts/service-baseline.yaml
groups:
  - name: service.baseline
    rules:
      - alert: HighErrorRate
        expr: |
          sum(rate(http_server_request_count_total{status_code=~"5.."}[5m])) by (service_name)
          /
          sum(rate(http_server_request_count_total{}[5m])) by (service_name)
          > 0.05
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "{{ $labels.service_name }} error rate above 5%"
          runbook: "https://wiki.my-org.com/runbooks/high-error-rate"
 
      - alert: HighLatency
        expr: |
          histogram_quantile(0.99,
            rate(http_server_request_duration_seconds_bucket[5m])
          ) by (service_name) > 2
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "{{ $labels.service_name }} p99 latency above 2s"
 
      - alert: ServiceDown
        expr: |
          up{job=~".*-api"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "{{ $labels.instance }} is down"

Teams can tighten these thresholds per service. The defaults should be conservative enough that they fire on real problems and permissive enough that they don't page on Tuesday morning noise.

Key Takeaways

  • Observability defaults mean every service is observable from its first deploy — no "first incident with a blank dashboard" tax.
  • A platform telemetry SDK wrapping OpenTelemetry handles all boilerplate (exporters, resource attributes, propagation) so product teams call one function and get instrumentation.
  • The four golden signals (latency, traffic, errors, saturation) should be instrumented by middleware in the scaffold, not by each service individually.
  • Dashboard-as-code templates provisioned automatically from catalog-info.yaml ensure every service has a useful starting dashboard without any manual work.
  • Baseline alert rules with conservative thresholds ship with the platform; teams tune them per service rather than starting from zero.
  • OpenTelemetry as the instrumentation standard prevents backend lock-in — swap Prometheus for Mimir, Jaeger for Tempo, without touching application code.
Share: