Observability in Depth

From Three Pillars to Four

Ravinder·August 1, 2025·5 min read

ObservabilityTelemetryEventsOpenTelemetry

Series

Observability in Depth

Part 1 of 10

Start of series

Part 2 →

Logs: Structured, Sampled, Retained

The "three pillars" framing — logs, metrics, traces — served us well when most systems were monolithic and a single grep could surface a bug. A decade of microservices, serverless, and event-driven architecture later, that model has a gap: discrete business events get shoehorned into log lines, stripped of context, and promptly lost in a flood of DEBUG noise.

This series starts by naming that gap and showing how promoting events to a first-class signal changes the way teams instrument, query, and act on telemetry.

Why Three Pillars Break Under Event-Driven Load

Consider an order-placement flow. A metric tells you orders-per-second. A trace tells you which service was slow. A log tells you INFO: order placed. None of them answer: which SKUs trigger payment failures at 3 × the baseline rate on Tuesdays?

That's an event question. Events carry rich, structured context at the moment something meaningful happens — not sampled aggregates, not free-text descriptions. They are the raw material from which every other signal is derived.

Signal	Granularity	Cost per unit	Best for
Metric	Aggregate	Very low	Alerting, dashboards
Log	Per record	Medium	Debugging free-text context
Trace	Per request	High	Latency root-cause
Event	Per action	Medium-high	Business behavior, funnels

The Four-Signal Model

flowchart LR App["Application"] --> E["Events\n(business actions)"] App --> L["Logs\n(free-text context)"] App --> M["Metrics\n(aggregates)"] App --> T["Traces\n(request journey)"] E --> EP["Event Pipeline\n(Kafka / Kinesis)"] L --> LP["Log Aggregator\n(Loki / OpenSearch)"] M --> MP["TSDB\n(Prometheus / Mimir)"] T --> TP["Trace Backend\n(Tempo / Jaeger)"] EP & LP & MP & TP --> Q["Unified Query Layer\n(Grafana / Honeycomb)"]

The key architectural decision is giving events their own pipeline. Routing them through your log aggregator works in prototype, but collapses under production cardinality. A dedicated stream (Kafka topic, Kinesis stream, or even a simple HTTP collector) gives you schema enforcement, independent retention, and replay.

Defining an Event Schema

Resist the temptation to emit raw JSON blobs. Agree on a base schema that every service uses:

{
  "schema_version": "1.0",
  "event_type": "order.placed",
  "timestamp": "2025-08-01T09:15:32.411Z",
  "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
  "service": "checkout-api",
  "environment": "production",
  "actor": {
    "user_id": "usr_8823",
    "session_id": "sess_abc123"
  },
  "payload": {
    "order_id": "ord_99821",
    "total_cents": 4299,
    "sku_count": 3,
    "payment_method": "card"
  }
}

trace_id is non-negotiable. It is the join key that lets you pivot from an anomalous event cluster to the traces that explain what happened.

Instrumenting with OpenTelemetry

OTel 1.x treats events as a sub-type of logs, reachable through the EventLogger API (experimental in most SDKs as of mid-2025, but stable in Java and Go):

// Go — OTel EventLogger
import (
    "go.opentelemetry.io/otel/log"
    "go.opentelemetry.io/otel/log/global"
)
 
func emitOrderPlaced(ctx context.Context, order Order) {
    logger := global.GetLoggerProvider().Logger("checkout-api")
 
    record := log.Record{}
    record.SetTimestamp(time.Now())
    record.SetEventName("order.placed")
    record.AddAttributes(
        log.String("order_id", order.ID),
        log.Int("total_cents", order.TotalCents),
        log.String("payment_method", order.PaymentMethod),
        log.String("user_id", order.UserID),
    )
 
    logger.Emit(ctx, record)
}

For Python services, until the EventLogger API stabilizes, emit via the OTel Logs SDK with a structured body:

import opentelemetry.sdk._logs as sdk_logs
from opentelemetry.sdk._logs.export import SimpleLogRecordProcessor
from opentelemetry._logs import SeverityNumber
 
def emit_event(ctx, event_type: str, payload: dict):
    record = sdk_logs.LogRecord(
        timestamp=time_ns(),
        trace_id=get_current_span(ctx).get_span_context().trace_id,
        severity_number=SeverityNumber.INFO,
        body=json.dumps({"event_type": event_type, **payload}),
        attributes={"event.name": event_type},
    )
    provider.get_logger(__name__).emit(record)

Routing Events Through an OTel Collector

The Collector is the right place to fan out: keep a copy in your TSDB-friendly format while routing the full payload to your event store.

# otelcol-config.yaml — event routing
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
 
processors:
  filter/events_only:
    logs:
      include:
        match_type: strict
        record_attributes:
          - key: event.name
            value: ".*"
            match_type: regexp
 
  batch:
    timeout: 5s
    send_batch_size: 1000
 
exporters:
  kafka/events:
    brokers: ["kafka:9092"]
    topic: "telemetry.events"
    encoding: otlp_json
 
  loki:
    endpoint: http://loki:3100/loki/api/v1/push
 
service:
  pipelines:
    logs/events:
      receivers: [otlp]
      processors: [filter/events_only, batch]
      exporters: [kafka/events]
    logs/general:
      receivers: [otlp]
      processors: [batch]
      exporters: [loki]

Correlating Events with Traces in Grafana

Once events land in a queryable store, you want one-click correlation. In Grafana, configure a derived field on your event datasource pointing to the trace backend:

{
  "name": "TraceID",
  "matcherType": "label",
  "matcherQuery": "trace_id",
  "url": "${__value.raw}",
  "urlDisplayLabel": "Open in Tempo",
  "datasourceUid": "tempo-prod",
  "type": "traceql"
}

Now every event row in Explore shows an "Open in Tempo" link that jumps directly to the corresponding request trace.

Schema Governance: Preventing Drift

The biggest operational risk with a fourth signal is schema sprawl. Two practices keep it manageable:

Schema registry — Register every event_type in Confluent Schema Registry (or a lightweight alternative like buf). Producers that emit unknown fields get rejected at the Collector layer.
Event catalog — A simple Git-tracked YAML file that documents what each event type means, who owns it, and its retention tier. Treat it like an API contract.

Key Takeaways

Logs, metrics, and traces cannot answer business-behavior questions at the granularity modern systems require — events fill that gap.
Events need a dedicated pipeline; routing them through log aggregators creates cardinality and cost problems.
trace_id embedded in every event is the join key for cross-signal correlation.
The OTel EventLogger API (stable in Go/Java) is the forward-looking instrumentation path; structured log emission is a workable interim for other runtimes.
Schema governance — a registry plus a catalog — is not optional; it is the difference between a queryable event store and a JSON graveyard.
Grafana derived fields make event-to-trace pivot a one-click operation rather than a manual trace-ID copy-paste.

Series

Observability in Depth

Part 1 of 10

Start of series

Part 2 →

Logs: Structured, Sampled, Retained