Skip to main content
Go for JVM Engineers

Profiling

Ravinder··6 min read
GoJVMJavaProfilingpprofPerformance
Share:
Profiling

JVM engineers have excellent profiling tools: JFR (Java Flight Recorder), async-profiler, and VisualVM for live heap inspection. Go's equivalent toolchain ships in the standard library and is wired directly into production services via an HTTP endpoint. The philosophical difference is that Go's profiler is designed to be always-on at low overhead, not something you attach in an emergency.

This post maps JFR/async-profiler workflows to pprof and the execution tracer.

The Profiling Primitives

Go exposes four built-in profiles:

Profile Equivalent JVM tool What it shows
CPU (profile) async-profiler CPU mode Which functions consume CPU time
Heap (heap) JFR heap statistics Live heap objects and allocation sites
Goroutine (goroutine) JFR thread dump Stack traces of all goroutines
Mutex (mutex) JFR monitor events Lock contention points
Block (block) JFR park events Where goroutines block on channels/syscalls

Enabling the pprof Endpoint

For a running service, import net/http/pprof as a side-effect. It registers /debug/pprof/ endpoints on the default mux.

import _ "net/http/pprof"
 
// In a separate goroutine for services that don't use DefaultServeMux
go func() {
    log.Println(http.ListenAndServe("localhost:6060", nil))
}()

Gate this behind a non-public address or require authentication before shipping to production. The /debug/pprof/ endpoint is as sensitive as a heap dump.

CPU Profile — The 30-Second Sample

# Collect 30 seconds of CPU profile
curl -o cpu.prof "http://localhost:6060/debug/pprof/profile?seconds=30"
 
# Explore interactively
go tool pprof cpu.prof
 
# Open flamegraph in browser (requires Graphviz)
go tool pprof -http=:8081 cpu.prof

The web UI shows a flamegraph where each box is a function and the width represents time spent. Hot paths glow wide — exactly like async-profiler's flamegraph output.

flowchart TD A[Running Service\n:6060/debug/pprof] -->|curl| B[cpu.prof\n30-second sample] B -->|go tool pprof -http| C[Browser Flamegraph] B -->|go tool pprof| D[Interactive CLI\ntop / list / web] C --> E{Hot Function\nidentified} D --> E E -->|fix| F[Optimise hot path] F -->|re-profile| A

Heap Profile — Finding Allocation Hotspots

# Heap profile (live objects + allocation sites)
curl -o heap.prof "http://localhost:6060/debug/pprof/heap"
 
go tool pprof -http=:8081 heap.prof

In the web UI, switch between inuse_space (currently live memory), inuse_objects (live object count), alloc_space (total bytes allocated since start), and alloc_objects (total allocations since start).

alloc_space is most useful for finding allocation hot paths — it shows which code is generating the most GC pressure, even if those objects are already collected.

// JFR equivalent: recording with heap allocation profiling
JFRConfiguration config = new JFRConfiguration("heap-profile");
config.enable("jdk.ObjectAllocationInNewTLAB");
// ... analysis in JMC
// Go: heap profile over HTTP, no JVM agent needed
// go tool pprof http://localhost:6060/debug/pprof/heap

Goroutine Profile — The Thread Dump Equivalent

# All goroutine stacks (like a JVM thread dump)
curl "http://localhost:6060/debug/pprof/goroutine?debug=2"
 
# Or in the pprof tool
go tool pprof -http=:8081 "http://localhost:6060/debug/pprof/goroutine"

Goroutine leaks — goroutines that are blocked and never complete — show up as a growing goroutine count. Look for stacks stuck in chan receive, select, or sync.Mutex.Lock with no corresponding sender.

Block and Mutex Profiles

These are off by default because sampling every block/lock event is expensive. Enable them at startup:

runtime.SetBlockProfileRate(1)   // sample every blocking event
runtime.SetMutexProfileFraction(1) // sample every mutex contention event

In production, use a lower fraction (e.g., 5 or 10) to sample 20% or 10% of events.

curl -o mutex.prof "http://localhost:6060/debug/pprof/mutex"
go tool pprof -http=:8081 mutex.prof

The Execution Tracer — Go's JFR

The CPU profiler samples at ~100 Hz. The execution tracer records every scheduler event at nanosecond resolution: goroutine creation, blocking, unblocking, GC start/stop, network I/O. It is Go's equivalent of JFR with all event types enabled.

# Collect 5 seconds of trace
curl -o trace.out "http://localhost:6060/debug/pprof/trace?seconds=5"
 
# Open in the trace viewer
go tool trace trace.out

The trace viewer shows:

  • Goroutine lifetimes on a timeline
  • Processor utilisation (P0–PN)
  • GC mark and sweep phases
  • Network poller wake-ups

This is invaluable for diagnosing latency spikes that do not show up in CPU profiles — a 50 ms tail latency caused by a GC pause is invisible in a CPU flamegraph but obvious in the execution trace.

Profiling in Tests

Benchmarks can emit profiles directly:

go test -bench=BenchmarkEncode -cpuprofile=cpu.prof -memprofile=mem.prof ./...
go tool pprof -http=:8081 cpu.prof

This is the cleanest way to profile a specific code path without the noise of a full service.

Continuous Profiling

For production, consider continuous profiling rather than on-demand sampling. The Pyroscope open-source project and Google Cloud Profiler both support Go with the pprof wire format. This is the Go equivalent of JFR streaming in JDK 14+.

import "github.com/grafana/pyroscope-go"
 
pyroscope.Start(pyroscope.Config{
    ApplicationName: "myservice",
    ServerAddress:   "http://pyroscope:4040",
    ProfileTypes: []pyroscope.ProfileType{
        pyroscope.ProfileCPU,
        pyroscope.ProfileAllocObjects,
        pyroscope.ProfileAllocSpace,
        pyroscope.ProfileInuseObjects,
        pyroscope.ProfileInuseSpace,
    },
})

Benchmark Comparison with benchstat

After optimising, use benchstat to confirm the improvement is statistically significant:

# Run the benchmark 10 times before and after the change
go test -bench=BenchmarkEncode -count=10 ./... > before.txt
# make the change
go test -bench=BenchmarkEncode -count=10 ./... > after.txt
benchstat before.txt after.txt

Output shows mean, variance, and whether the delta is statistically significant — equivalent to JMH's confidence intervals.

Key Takeaways

  • pprof endpoints are built into net/http/pprof and safe to leave in production binaries behind an internal address — they are always-on at negligible overhead.
  • The CPU flamegraph (go tool pprof -http) is the direct equivalent of async-profiler's flamegraph and identifies hot functions in seconds.
  • alloc_space in the heap profile finds GC pressure hot paths even for objects that have already been collected — prefer this over inuse_space when diagnosing throughput regressions.
  • The execution tracer (go tool trace) records every scheduler event at nanosecond resolution — use it to diagnose tail latency, GC pause impact, and goroutine scheduling delays that CPU profiles miss.
  • Enable block and mutex profiling with a sampling fraction in production to find lock contention without measurable overhead.
  • Use benchstat to compare benchmark runs with statistical confidence before declaring an optimisation successful.
Share: