Skip to main content
Observability in Depth

Logs: Structured, Sampled, Retained

Ravinder··5 min read
ObservabilityTelemetryLoggingLokiOpenTelemetry
Share:
Logs: Structured, Sampled, Retained

Most teams treat logging as a solved problem until the bill arrives. At a few thousand requests per second, unstructured logs with no sampling policy routinely generate terabytes per day — the majority of which is never queried. Paying for storage and indexing of DEBUG lines that nobody reads is a tax on poor instrumentation discipline.

This post covers the three levers that together make a log pipeline both useful and affordable: structure, sampling, and retention tiering.

The Cost Anatomy of a Log Pipeline

Before tuning anything, understand where money goes. In a typical Loki or OpenSearch deployment the breakdown looks like this:

pie title Log Pipeline Cost Distribution "Ingest bandwidth & processing" : 35 "Index storage (hot)" : 30 "Object storage (warm/cold)" : 20 "Query compute" : 15

Ingest and hot index are the expensive layers. Both shrink dramatically when you emit less noise and when each log line carries enough structure to be useful without full-text search.

Structured Logging: JSON All the Way Down

A string like "2025-08-08 ERROR: payment failed for user 8823" requires regex to parse in your query layer. The equivalent JSON:

{
  "level": "error",
  "timestamp": "2025-08-08T14:22:01.003Z",
  "service": "payment-api",
  "trace_id": "a3ce929d0e0e4736",
  "user_id": "8823",
  "event": "payment_failed",
  "error_code": "CARD_DECLINED",
  "amount_cents": 4299,
  "duration_ms": 142
}

...is queryable in LogQL or OpenSearch DSL without any parsing pipeline on read. Write-time structuring is always cheaper than read-time parsing.

Go with slog (stdlib since Go 1.21):

import "log/slog"
 
func handlePayment(ctx context.Context, req PaymentRequest) error {
    logger := slog.With(
        "service", "payment-api",
        "trace_id", traceIDFromCtx(ctx),
        "user_id", req.UserID,
    )
 
    if err := charge(req); err != nil {
        logger.ErrorContext(ctx, "payment_failed",
            "error_code", classifyError(err),
            "amount_cents", req.AmountCents,
            "duration_ms", req.Elapsed.Milliseconds(),
        )
        return err
    }
    return nil
}

Python with structlog:

import structlog
 
log = structlog.get_logger().bind(service="payment-api")
 
def handle_payment(request):
    bound = log.bind(trace_id=get_trace_id(), user_id=request.user_id)
    try:
        charge(request)
    except CardDeclined as e:
        bound.error("payment_failed",
                    error_code=e.code,
                    amount_cents=request.amount_cents)
        raise

Sampling: What to Keep and What to Drop

Not every log line deserves to survive ingestion. A principled sampling strategy keeps signal while dramatically cutting volume:

flowchart TD Log["Log Record"] --> L{Level?} L -- ERROR/WARN --> Keep["Keep 100%"] L -- INFO --> Tr{Trace sampled?} Tr -- Yes --> Keep Tr -- No --> RS{Rate sample} RS -- 1-in-10 --> Keep RS -- 9-in-10 --> Drop["Drop"] L -- DEBUG --> Env{Env?} Env -- prod --> Drop Env -- staging --> Keep

The rule of thumb: errors are always kept, INFO follows trace sampling, DEBUG never reaches production.

Implement this at the OTel Collector layer so no application code needs to change:

# otelcol sampling processor for logs
processors:
  filter/drop_debug_prod:
    logs:
      exclude:
        match_type: strict
        severity_texts: ["DEBUG", "TRACE"]
 
  probabilistic_sampler/info:
    hash_seed: 22
    sampling_percentage: 10
    attribute_source: record
    from_attribute: trace_id
 
  # Always keep errors — put this BEFORE the sampler
  filter/keep_errors:
    logs:
      include:
        match_type: regexp
        severity_texts: ["ERROR", "WARN", "FATAL"]
 
service:
  pipelines:
    logs:
      receivers: [otlp]
      processors:
        - filter/drop_debug_prod
        - filter/keep_errors
        - probabilistic_sampler/info
        - batch
      exporters: [loki]

Head-based sampling tied to trace_id ensures that a sampled trace retains its correlated logs — you don't get traces without logs or vice versa.

Retention Tiering: Pay Only for What You Query

One retention policy for all logs is almost always wrong. Regulatory logs need 7 years; DEBUG output from a dev cluster needs 24 hours. Map log categories to tiers:

Tier Retention Storage class Example log types
Hot 7 days SSD / RAM index Errors, warnings, audits
Warm 30 days HDD / compressed INFO from sampled traces
Cold 1 year+ Object storage (S3) Audit logs, compliance
Purge 24–48 h None (drop) Debug, health-check noise

In Loki, configure this with per-stream retention rules:

# Loki ruler config — per-stream retention
limits_config:
  retention_period: 744h  # default 31 days
 
ruler:
  storage:
    type: local
 
# Per-tenant or per-stream overrides via compactor
compactor:
  retention_enabled: true
  retention_delete_delay: 2h
  retention_delete_worker_count: 150
 
# Stream-level rules in loki-rules.yaml
---
groups:
  - name: retention
    rules:
      - match:
          selector: '{level="debug"}'
        period: 24h
      - match:
          selector: '{level=~"error|warn", env="production"}'
        period: 2160h  # 90 days
      - match:
          selector: '{type="audit"}'
        period: 61320h  # 7 years

Enrichment Without Explosion

Enriching logs at the Collector — adding k8s labels, region, cluster name — is powerful but dangerous. Each unique label combination creates a new Loki stream. Cardinality rules for logs mirror those for metrics (covered in post 3): label values must be bounded.

Safe enrichment:

processors:
  resource/k8s:
    attributes:
      - action: upsert
        key: k8s.namespace.name
        from_attribute: namespace
      - action: upsert
        key: k8s.deployment.name
        from_attribute: app
      # DO NOT add pod name as a label — unbounded cardinality

Add pod_name to the log body, not as a Loki stream label. Query it with | json pod_name="..." rather than {pod_name="..."}.

Querying Effectively with LogQL

A structured pipeline pays off at query time. With JSON logs you can filter and aggregate without regex acrobatics:

# Error rate by service over last 5m
sum by (service) (
  rate({env="production"} | json | level="error" [5m])
)
 
# P99 duration for payment failures
quantile_over_time(0.99,
  {service="payment-api"} | json | error_code != ""
  | unwrap duration_ms [10m]
)

Key Takeaways

  • Unstructured logs are a cost center masquerading as an observability tool; JSON at the source eliminates read-time parsing tax.
  • Sampling must be trace-aware — dropping INFO logs that belong to a sampled trace breaks correlation and defeats the purpose of distributed tracing.
  • Retention tiers are not a storage optimization alone; they enforce hygiene by making you declare what each log category is worth.
  • Label cardinality in Loki must be governed the same way metric labels are governed — put high-cardinality values in the log body, not the stream selector.
  • OTel Collector is the right place to enforce sampling and enrichment policy; keep application code focused on emitting structured records.
  • A well-structured log pipeline routinely cuts ingest volume by 60–80% with zero loss of debuggability for production incidents.
Share: