Logs: Structured, Sampled, Retained
Series
Observability in DepthMost teams treat logging as a solved problem until the bill arrives. At a few thousand requests per second, unstructured logs with no sampling policy routinely generate terabytes per day — the majority of which is never queried. Paying for storage and indexing of DEBUG lines that nobody reads is a tax on poor instrumentation discipline.
This post covers the three levers that together make a log pipeline both useful and affordable: structure, sampling, and retention tiering.
The Cost Anatomy of a Log Pipeline
Before tuning anything, understand where money goes. In a typical Loki or OpenSearch deployment the breakdown looks like this:
Ingest and hot index are the expensive layers. Both shrink dramatically when you emit less noise and when each log line carries enough structure to be useful without full-text search.
Structured Logging: JSON All the Way Down
A string like "2025-08-08 ERROR: payment failed for user 8823" requires regex to parse in your query layer. The equivalent JSON:
{
"level": "error",
"timestamp": "2025-08-08T14:22:01.003Z",
"service": "payment-api",
"trace_id": "a3ce929d0e0e4736",
"user_id": "8823",
"event": "payment_failed",
"error_code": "CARD_DECLINED",
"amount_cents": 4299,
"duration_ms": 142
}...is queryable in LogQL or OpenSearch DSL without any parsing pipeline on read. Write-time structuring is always cheaper than read-time parsing.
Go with slog (stdlib since Go 1.21):
import "log/slog"
func handlePayment(ctx context.Context, req PaymentRequest) error {
logger := slog.With(
"service", "payment-api",
"trace_id", traceIDFromCtx(ctx),
"user_id", req.UserID,
)
if err := charge(req); err != nil {
logger.ErrorContext(ctx, "payment_failed",
"error_code", classifyError(err),
"amount_cents", req.AmountCents,
"duration_ms", req.Elapsed.Milliseconds(),
)
return err
}
return nil
}Python with structlog:
import structlog
log = structlog.get_logger().bind(service="payment-api")
def handle_payment(request):
bound = log.bind(trace_id=get_trace_id(), user_id=request.user_id)
try:
charge(request)
except CardDeclined as e:
bound.error("payment_failed",
error_code=e.code,
amount_cents=request.amount_cents)
raiseSampling: What to Keep and What to Drop
Not every log line deserves to survive ingestion. A principled sampling strategy keeps signal while dramatically cutting volume:
The rule of thumb: errors are always kept, INFO follows trace sampling, DEBUG never reaches production.
Implement this at the OTel Collector layer so no application code needs to change:
# otelcol sampling processor for logs
processors:
filter/drop_debug_prod:
logs:
exclude:
match_type: strict
severity_texts: ["DEBUG", "TRACE"]
probabilistic_sampler/info:
hash_seed: 22
sampling_percentage: 10
attribute_source: record
from_attribute: trace_id
# Always keep errors — put this BEFORE the sampler
filter/keep_errors:
logs:
include:
match_type: regexp
severity_texts: ["ERROR", "WARN", "FATAL"]
service:
pipelines:
logs:
receivers: [otlp]
processors:
- filter/drop_debug_prod
- filter/keep_errors
- probabilistic_sampler/info
- batch
exporters: [loki]Head-based sampling tied to trace_id ensures that a sampled trace retains its correlated logs — you don't get traces without logs or vice versa.
Retention Tiering: Pay Only for What You Query
One retention policy for all logs is almost always wrong. Regulatory logs need 7 years; DEBUG output from a dev cluster needs 24 hours. Map log categories to tiers:
| Tier | Retention | Storage class | Example log types |
|---|---|---|---|
| Hot | 7 days | SSD / RAM index | Errors, warnings, audits |
| Warm | 30 days | HDD / compressed | INFO from sampled traces |
| Cold | 1 year+ | Object storage (S3) | Audit logs, compliance |
| Purge | 24–48 h | None (drop) | Debug, health-check noise |
In Loki, configure this with per-stream retention rules:
# Loki ruler config — per-stream retention
limits_config:
retention_period: 744h # default 31 days
ruler:
storage:
type: local
# Per-tenant or per-stream overrides via compactor
compactor:
retention_enabled: true
retention_delete_delay: 2h
retention_delete_worker_count: 150
# Stream-level rules in loki-rules.yaml
---
groups:
- name: retention
rules:
- match:
selector: '{level="debug"}'
period: 24h
- match:
selector: '{level=~"error|warn", env="production"}'
period: 2160h # 90 days
- match:
selector: '{type="audit"}'
period: 61320h # 7 yearsEnrichment Without Explosion
Enriching logs at the Collector — adding k8s labels, region, cluster name — is powerful but dangerous. Each unique label combination creates a new Loki stream. Cardinality rules for logs mirror those for metrics (covered in post 3): label values must be bounded.
Safe enrichment:
processors:
resource/k8s:
attributes:
- action: upsert
key: k8s.namespace.name
from_attribute: namespace
- action: upsert
key: k8s.deployment.name
from_attribute: app
# DO NOT add pod name as a label — unbounded cardinalityAdd pod_name to the log body, not as a Loki stream label. Query it with | json pod_name="..." rather than {pod_name="..."}.
Querying Effectively with LogQL
A structured pipeline pays off at query time. With JSON logs you can filter and aggregate without regex acrobatics:
# Error rate by service over last 5m
sum by (service) (
rate({env="production"} | json | level="error" [5m])
)
# P99 duration for payment failures
quantile_over_time(0.99,
{service="payment-api"} | json | error_code != ""
| unwrap duration_ms [10m]
)Key Takeaways
- Unstructured logs are a cost center masquerading as an observability tool; JSON at the source eliminates read-time parsing tax.
- Sampling must be trace-aware — dropping INFO logs that belong to a sampled trace breaks correlation and defeats the purpose of distributed tracing.
- Retention tiers are not a storage optimization alone; they enforce hygiene by making you declare what each log category is worth.
- Label cardinality in Loki must be governed the same way metric labels are governed — put high-cardinality values in the log body, not the stream selector.
- OTel Collector is the right place to enforce sampling and enrichment policy; keep application code focused on emitting structured records.
- A well-structured log pipeline routinely cuts ingest volume by 60–80% with zero loss of debuggability for production incidents.