JFR-Driven Performance Work: Flight Recorder for the Rest of Us
The Profiler You Already Have
Every JVM since Java 11 ships with Java Flight Recorder. It is always-on capable, designed to have less than 1% overhead in its default configuration, and it captures more useful signal than most teams know how to consume. Despite this, most Java shops reach for APM dashboards, jstack dumps, or heap profilers before ever opening a .jfr file.
That is a mistake. JFR captures a continuous, structured event stream covering GC behavior, JIT compilation decisions, thread scheduling, lock contention, I/O, and application-defined events — all with timestamps accurate to the microsecond. No other tool gives you this picture without instrumentation overhead that changes what you are measuring.
This article is about making JFR practical: enabling it safely, finding signal in the noise, writing custom events for your domain, and connecting observations to actual performance improvements.
Enabling JFR Without Fear
The concern teams have about enabling JFR in production is understandable but mostly unfounded. The default profile is designed for continuous operation.
Option 1: JVM flags at startup (preferred)
java \
-XX:StartFlightRecording=filename=/var/log/app/startup.jfr,\
duration=60s,\
settings=default \
-XX:FlightRecorderOptions=stackdepth=64,\
maxchunksize=12m \
-jar application.jarThis captures a 60-second recording on startup, which is useful for catching initialization issues. For continuous recording:
java \
-XX:StartFlightRecording=\
name=continuous,\
filename=/var/log/jfr/recording-%t.jfr,\
maxsize=500m,\
maxage=24h,\
settings=profile \
-jar application.jarmaxsize and maxage together create a rolling buffer. The JVM writes chunks and evicts old data automatically. On a busy service this produces roughly 100-300MB/hour depending on event density.
Option 2: Attach to a running process
# list running JVMs
jcmd
# start recording on a running process
jcmd <pid> JFR.start name=incident duration=120s filename=/tmp/incident.jfr settings=profile
# check status
jcmd <pid> JFR.check
# dump current data without stopping
jcmd <pid> JFR.dump filename=/tmp/snapshot.jfr
# stop recording
jcmd <pid> JFR.stop name=incidentThis is the workflow for incident response. You noticed something wrong, you want to capture two minutes of detailed data, and you do not want to restart the JVM.
Settings: default vs profile
| Setting | Overhead | Stack depth | Ideal for |
|---|---|---|---|
default |
< 1% | 64 frames | Continuous production recording |
profile |
2-3% | 128 frames | Incident investigation, performance analysis |
The Events That Matter
JFR captures hundreds of event types. These are the ones worth looking at first.
GC: jdk.GarbageCollection, jdk.GCHeapSummary
Look at pause duration and frequency together. Short pauses at high frequency often indicate allocation pressure — you are producing garbage faster than G1 can collect minor generations. Long pauses with low frequency indicate survivor space or old gen issues.
jdk.GarbageCollection {
gcId = 142
name = "G1New"
cause = "G1 Evacuation Pause"
duration = 87.2 ms ← this is the number you watch
...
}A threshold worth setting: alert if any GC pause exceeds 200ms or if GC is consuming more than 5% of wall clock time.
Locking: jdk.JavaMonitorBlocked
This is the event that reveals lock contention before it becomes obvious in thread dumps. It fires when a thread blocks trying to acquire a monitor.
jdk.JavaMonitorBlocked {
monitorClass = java.util.LinkedList (classLoader = app)
blockedTime = 4.2 ms
stackTrace = [
com.acme.cache.LocalCache.get(LocalCache.java:87)
com.acme.service.UserService.findById(UserService.java:142)
...
]
}If you see the same monitorClass appearing repeatedly with long blockedTime values, you have a contention problem. The stack trace tells you which code is responsible.
I/O: jdk.SocketRead, jdk.SocketWrite
These events capture every network read and write with the remote address, bytes transferred, and duration. Sorting by duration reveals slow external calls that your APM may be averaging away.
JIT: jdk.Deoptimization
Deoptimization events fire when the JIT compiled code speculatively and had to fall back to interpreted mode. Frequent deoptimizations are a sign of polymorphic call sites or type assumptions being violated. In most applications you can ignore this. In hot loops it matters significantly.
Java Mission Control
JFR files are most effectively analyzed in Java Mission Control (JMC), which is freely available from adoptium.net or as part of Oracle JDK.
The automated analysis rules are the best starting point. JMC runs a set of heuristics against the recording and surfaces findings like "GC overhead is 8% of wall time" or "Thread X was blocked 34% of the time on monitor Y." These findings have severity ratings and links to the underlying events.
Workflow for a new recording:
- Open recording in JMC
- Review the Automated Analysis findings — start with High severity
- For GC issues, go to the Memory view and plot heap usage, pause duration, and allocation rate together
- For CPU issues, open the Method Profiling view and examine the flame graph
- For contention issues, open Thread > Thread States and look for prolonged BLOCKED states
- Use Event Browser to filter specific event types when the views do not give enough detail
Custom Events: Instrumenting Your Domain
The most powerful JFR capability that almost nobody uses is custom events. You can define application-level events that flow into the JFR stream alongside JVM events, with zero overhead when the recording is not active.
@Name("com.acme.order.OrderProcessed")
@Label("Order Processed")
@Category({"Business", "Orders"})
@StackTrace(false) // skip stack trace for high-frequency events
public class OrderProcessedEvent extends Event {
@Label("Order ID")
public String orderId;
@Label("Customer Tier")
public String customerTier;
@Label("Item Count")
public int itemCount;
@Label("Total Amount Cents")
public long totalAmountCents;
@Label("Fulfillment Region")
public String region;
}Usage in service code:
public OrderResult processOrder(Order order) {
OrderProcessedEvent event = new OrderProcessedEvent();
event.begin();
try {
OrderResult result = executeOrderProcessing(order);
event.orderId = order.getId();
event.customerTier = order.getCustomer().getTier();
event.itemCount = order.getItems().size();
event.totalAmountCents = result.getTotalCents();
event.region = result.getFulfillmentRegion();
return result;
} finally {
event.commit(); // no-op if JFR recording is not active
}
}The event.commit() call checks whether JFR is active before writing. If no recording is running, this is effectively free — a single branch prediction. This means you can leave custom events in production code permanently and enable them only when you need them.
Correlating business events with JVM events is where custom events become genuinely powerful. If you can see that GC pauses correlate with spikes in OrderProcessed events for customerTier=premium, you have narrowed a performance investigation from "the JVM is slow sometimes" to "premium order processing causes allocation pressure."
Real Performance Wins From JFR Data
Win 1: Eliminated a string concatenation hot path
JFR method profiling showed StringBuilder.append() consuming 12% of CPU in a report generation path. Investigation revealed a loop building SQL IN clauses by concatenation. Replaced with String.join() on a pre-built list. CPU dropped 9%.
Win 2: Fixed connection pool exhaustion
jdk.SocketRead events showed database reads averaging 340ms on a service that expected 20ms. Cross-referencing with jdk.JavaMonitorBlocked events on HikariPool revealed threads waiting for connections. The pool was sized at 10 connections for a service handling 200 concurrent requests. Resized to 30, adjusted timeout configuration, p99 dropped from 2.1s to 180ms.
Win 3: Caught a G1 region size misconfiguration
GC events showed frequent G1 Humongous Allocation causes. Humongous allocations happen when an object exceeds 50% of a G1 region size. In this case a caching library was storing serialized objects that averaged 1.8MB against a default region size of 2MB. Adding -XX:G1HeapRegionSize=4m eliminated humongous allocations and reduced GC pause frequency by 60%.
Key Takeaways
- JFR with
settings=defaultruns at less than 1% overhead and is safe for permanent production enablement; usejcmd JFR.startduring incidents rather than restarting the JVM. - GC pause duration and frequency together reveal allocation pressure; alone they are ambiguous.
jdk.JavaMonitorBlockedevents find lock contention before it becomes obvious in thread dumps or APM latency metrics.- Custom JFR events are zero-cost when recording is inactive — instrument business operations permanently and enable them when investigating anomalies.
- JMC automated analysis is the fastest path from a
.jfrfile to actionable findings; start there before exploring raw events. - Correlating custom business events with JVM events is the most powerful use of JFR and the least commonly practiced.