Observability in Depth

Continuous Profiling

Ravinder·August 29, 2025·6 min read

ObservabilityTelemetryProfilingPyroscopePerformance

Series

Observability in Depth

Part 5 of 10

← Part 4

Tracing: End-to-End, Including the Queue

Part 6 →

Synthetic Monitoring

Profiling is the signal most teams reach for last, because the mental model is "run a profiler locally, find the slow function, fix it, done." That model misses the category of performance problems that only manifest under real production load — the lock contention that appears at 500 concurrent users, the GC pressure spike every 15 minutes, the allocator hot path that warms up over four hours of traffic before degrading.

Continuous profiling closes this gap. It means profiling is always on, samples are stored alongside metrics and traces, and you can compare CPU, memory, and goroutine profiles across any two time windows — yesterday's deploy versus today, peak versus off-peak.

How Continuous Profiling Fits the Signal Stack

flowchart TD Alert["Metric alert fires:\nP99 latency > 2s"] --> Trace["Trace: find slow service\n→ checkout-api"] Trace --> Profile["Profile: find hot function\n→ checkout-api CPU @ 14:23"] Profile --> Fix["Code change: reduce allocations\nin pricing engine"] Fix --> Metric["Metrics confirm:\nlatency back to baseline"] style Alert fill:#e74c3c,color:#fff style Profile fill:#2980b9,color:#fff

Profiles answer the question traces cannot: why is a function slow? A trace tells you pricing.Calculate took 800ms. A CPU profile taken during that window tells you 70% of that time was spent in json.Unmarshal deserializing a price catalogue that could be cached.

pprof in Production: What Is Safe

Go's net/http/pprof has been production-safe for years, but with caveats:

CPU profiling runs at 100Hz by default (1% overhead). Raising to 1000Hz adds ~5% CPU overhead — fine for brief sampling, risky if left on continuously.
Heap profiling is always-on and cheap (alloc sampling at 512KB rate).
goroutine profiling can pause on large goroutine counts — avoid frequent sampling if you run 100k+ goroutines.
block/mutex profiling must be explicitly enabled and incurs non-trivial overhead. Enable only for targeted investigations.

// main.go — expose pprof with auth guard
import (
    _ "net/http/pprof"
    "net/http"
)
 
func startDebugServer() {
    mux := http.NewServeMux()
    // pprof registers on DefaultServeMux; use a separate mux in prod
    mux.HandleFunc("/debug/pprof/", requireInternalAuth(http.DefaultServeMux.ServeHTTP))
 
    go http.ListenAndServe(":6060", mux)
}

Expose the debug port on an internal network only. Never route it through your public load balancer.

Pyroscope: Always-On Pull Model

Pyroscope runs as a sidecar or central server. It pulls profiles from your pprof endpoints on a configurable interval (default 15s) and stores them in its own compressed format alongside labels that correlate with your metrics.

Go SDK push mode (lower latency than HTTP pull):

import "github.com/grafana/pyroscope-go"
 
func initPyroscope() {
    pyroscope.Start(pyroscope.Config{
        ApplicationName: "checkout-api",
        ServerAddress:   "http://pyroscope:4040",
        Logger:          pyroscope.StandardLogger,
        Tags: map[string]string{
            "env":     os.Getenv("ENV"),
            "version": buildVersion,
            "region":  os.Getenv("AWS_REGION"),
        },
        ProfileTypes: []pyroscope.ProfileType{
            pyroscope.ProfileCPU,
            pyroscope.ProfileAllocObjects,
            pyroscope.ProfileAllocSpace,
            pyroscope.ProfileInuseObjects,
            pyroscope.ProfileInuseSpace,
            pyroscope.ProfileGoroutines,
        },
    })
}

Python with py-spy (zero-code-change pull mode):

# docker-compose — Pyroscope pulls py-spy profiles from Python container
pyroscope:
  image: grafana/pyroscope:latest
  command:
    - "server"
  volumes:
    - pyroscope-data:/data
 
# Scrape config in pyroscope server config
scrape_configs:
  - job_name: "python-api"
    enabled_profiles:
      - profile_type: process_cpu
    targets:
      - targets: ["python-api:6060"]
        labels:
          service: "python-api"
          env: "production"

Parca: eBPF-Based System-Wide Profiling

Parca Agent uses eBPF to profile all processes on a node without any application instrumentation. It resolves symbols from debug info and sends profiles to Parca server:

# parca-agent DaemonSet (Kubernetes)
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: parca-agent
spec:
  template:
    spec:
      hostPID: true
      containers:
        - name: parca-agent
          image: ghcr.io/parca-dev/parca-agent:latest
          args:
            - /bin/parca-agent
            - --node=$(NODE_NAME)
            - --remote-store-address=parca.monitoring:7070
            - --remote-store-insecure
          securityContext:
            privileged: true
          volumeMounts:
            - mountPath: /sys/fs/bpf
              name: bpf-fs
            - mountPath: /sys/kernel/debug
              name: kernel-debug
      volumes:
        - name: bpf-fs
          hostPath:
            path: /sys/fs/bpf
        - name: kernel-debug
          hostPath:
            path: /sys/kernel/debug

Parca eBPF profiles give you CPU utilization broken down by function across the entire host — useful for finding hotspots in libraries, kernel calls, and processes you don't own the source for.

Reading Flame Graphs: What to Look For

A flame graph is a sorted call stack visualization where width = time. The actionable patterns:

graph TD Root["root (100%)"] --> A["http.Handler (95%)"] A --> B["pricing.Calculate (70%)"] A --> C["auth.Validate (20%)"] B --> D["json.Unmarshal (60%) ← WIDE PLATEAU = hot path"] B --> E["db.Query (8%)"] C --> F["crypto/rsa.Verify (18%) ← consider caching"]

Wide plateau at the top of a stack: the function is expensive in itself, not its callees. Optimization target.
Wide plateau in the middle: many different callers converge on this function. High-impact change if you optimize it.
Tall narrow stack: deep call chains consuming little total time. Usually not worth optimizing.
Flat spread across many functions: diffuse cost, often GC or syscall overhead. Look at allocations profile instead.

Correlating Profiles with Traces

Both Pyroscope and Parca support linking profiles to traces by embedding trace/span IDs in the profiling labels. In Go:

// Tag the current profiling window with the active trace ID
ctx, span := tracer.Start(ctx, "pricing.Calculate")
defer span.End()
 
pyroscope.TagWrapper(ctx, pyroscope.Labels(
    "trace_id", span.SpanContext().TraceID().String(),
), func(ctx context.Context) {
    result = pricingEngine.Calculate(ctx, items)
})

Grafana's profile-trace correlation panel shows the flame graph filtered to the exact time window of a trace — turning a slow span into a clickable profile.

Setting Up a Continuous Profiling Alert

Profiles are not just for debugging — you can alert on them:

# Pyroscope recording rules (PromQL-compatible via Grafana)
groups:
  - name: profiling
    rules:
      - alert: CPUHotFunction
        expr: |
          sum by (function_name, service) (
            rate(pyroscope_cpu_seconds_total{env="production"}[5m])
          ) > 0.4
        for: 10m
        annotations:
          summary: "{{ $labels.function_name }} consuming >40% CPU in {{ $labels.service }}"

Key Takeaways

Continuous profiling captures performance regressions that only appear under real production load — local benchmarks and load tests miss them.
CPU profiling at 100Hz (Go default) adds ~1% overhead and is safe to run continuously; heap profiling is always-on and essentially free.
Pyroscope (SDK push) is the pragmatic choice for mixed-language stacks; Parca eBPF agent profiles the entire node without code changes, useful for infrastructure-level analysis.
Flame graph patterns — wide plateaus, deep vs. shallow stacks, diffuse spread — each suggest a different class of optimization.
Embedding trace_id in profiling labels enables profile-trace correlation: from a slow span you can jump to the flame graph captured during that exact window.
Profile data stored alongside metrics enables alerting on function-level CPU consumption, not just aggregate service CPU.

Series

Observability in Depth

Part 5 of 10

← Part 4

Tracing: End-to-End, Including the Queue

Part 6 →

Synthetic Monitoring