Skip to main content
Python for the JVM Engineer

Performance and Profiling

Ravinder··5 min read
PythonJVMJavaprofilingcProfilepy-spyperformanceflamegraph
Share:
Performance and Profiling

A Java performance investigation has a well-worn playbook: attach async-profiler or enable Java Flight Recorder, load a flamegraph into JMC or Speedscope, identify the hot frame, fix the allocation pressure or lock contention, measure again. Python performance work follows the same scientific loop — but the tools differ, the baseline is slower, and the bottlenecks often look different than you expect.

The Performance Baseline Reality

Before profiling, recalibrate expectations. CPython executes roughly 10–50 million simple operations per second on modern hardware. A JVM with HotSpot JIT compiling hot paths will do 10–100x that. If your workload is pure computation, Python is the wrong tool for that layer — you need NumPy, C extensions, or a language boundary.

flowchart LR A["Pure Python loop"] --> B["~10M ops/sec"] C["NumPy vectorised"] --> D["~500M ops/sec (C)"] E["JVM (JIT warmed)"] --> F["~500M-1B ops/sec"] G["Native C/Rust"] --> H["~1-5B ops/sec"]

Profiling helps you find where Python time is actually spent — frequently the answer is "in a library's C extension, which is fast" or "in a pure Python loop you can vectorise."

cProfile — The Built-in Deterministic Profiler

cProfile is the stdlib deterministic profiler — it instruments every function call and records cumulative time. The Java equivalent is the sampling-based approach in JFR, but cProfile is closer to the old -Xrunhprof:cpu=times deterministic profiler.

import cProfile
import pstats
import io
 
def slow_function():
    total = 0
    for i in range(1_000_000):
        total += i * i
    return total
 
pr = cProfile.Profile()
pr.enable()
result = slow_function()
pr.disable()
 
s = io.StringIO()
ps = pstats.Stats(pr, stream=s).sort_stats("cumulative")
ps.print_stats(10)   # top 10 by cumulative time
print(s.getvalue())

Or from the command line:

python -m cProfile -o profile.out my_script.py
python -m pstats profile.out

cProfile output:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.412    0.412    0.412    0.412 script.py:4(slow_function)
        1    0.000    0.000    0.412    0.412 {built-in method builtins.exec}

tottime is time in the function excluding callees (like JFR self-time). cumtime includes callees (like JFR total-time).

Overhead: deterministic profiling adds 10–30% runtime overhead — do not run in production. Use it for local investigation.

line_profiler — Line-Level Granularity

When cProfile points to a function but not the specific line, line_profiler gives line-by-line breakdown:

# pip install line_profiler
from line_profiler import profile
 
@profile
def process_items(items: list[int]) -> int:
    total = 0
    for item in items:           # ← line_profiler shows this is 60% of time
        total += item * item
    return total
kernprof -l -v script.py

This is closer to async-profiler's line-level annotation in IntelliJ — you see exactly which statements are hot.

py-spy — The Production-Safe Sampling Profiler

py-spy is a sampling profiler (like async-profiler for the JVM) that attaches to a running process without modifying the code. It has near-zero overhead and works in production.

pip install py-spy
 
# Profile a running process by PID
py-spy top --pid 12345
 
# Generate a flamegraph
py-spy record -o profile.svg --pid 12345 --duration 30
 
# Profile a command directly
py-spy record -o profile.svg -- python my_app.py

The flamegraph output is identical in interpretation to an async-profiler flamegraph: x-axis is time proportion, y-axis is call stack depth. Wide frames at the top are your hotspots.

flowchart TD A["py-spy\n(attaches to running PID)"] -->|"samples call stacks at 100Hz"| B["stack samples"] B --> C["aggregate by call path"] C --> D["SVG flamegraph\n(identical to async-profiler output)"] D --> E["identify wide frames\n= bottlenecks"]

memray — Memory Profiling

memray (from Bloomberg) is the Python equivalent of JFR's memory allocation profiling:

pip install memray
 
# Profile memory allocations
python -m memray run --output profile.bin my_script.py
python -m memray flamegraph profile.bin   # generates HTML flamegraph
 
# Live view
python -m memray run --live my_script.py

Common memory bottlenecks for JVM engineers:

  • Large list comprehensions that could be generators
  • String concatenation in a loop ("a" + "b" creates a new object each iteration — use "".join(parts))
  • Holding references longer than needed (Python's reference counting relies on objects going out of scope)

Benchmarking with timeit

For micro-benchmarks (equivalent to JMH), use timeit:

import timeit
 
# List comprehension vs generator for sum
list_time = timeit.timeit(
    "sum([i*i for i in range(10000)])",
    number=10_000
)
gen_time = timeit.timeit(
    "sum(i*i for i in range(10000))",
    number=10_000
)
print(f"list: {list_time:.3f}s  generator: {gen_time:.3f}s")

Or from the command line:

python -m timeit "sum(i*i for i in range(10000))"

Common Bottlenecks and Fixes

Symptom Python cause Fix
Slow loop over large data Pure Python iteration NumPy vectorisation
High memory, many small objs __dict__ on many instances @dataclass(slots=True)
Slow JSON parsing json.loads pure Python orjson or ujson (C extensions)
Slow regex Complex backtracking re2 binding (linear-time)
Slow HTTP client requests (sync) httpx async or aiohttp
CPU-bound single-threaded GIL multiprocessing, NumPy, Cython

The Investigation Loop

sequenceDiagram participant Dev participant cProfile participant pySpy participant Code Dev->>cProfile: profile locally cProfile-->>Dev: cumtime by function Dev->>Code: identify hotspot function Dev->>pySpy: attach in staging pySpy-->>Dev: flamegraph Dev->>Code: apply fix (vectorise / C ext) Dev->>cProfile: measure again

This mirrors the JFR + JMC workflow: record → analyse → hypothesise → fix → validate. The tools are different; the scientific method is identical.

Key Takeaways

  • cProfile is the stdlib deterministic profiler — use it locally; it adds 10–30% overhead so do not run in production.
  • py-spy is the production-safe sampling profiler (like async-profiler) — it attaches by PID, generates flamegraphs, and has near-zero overhead.
  • Flamegraph interpretation is identical between async-profiler and py-spy: wide frames = hot code paths.
  • memray is JFR's allocation profiling equivalent — use it to diagnose memory growth and excessive object creation.
  • The most common Python bottleneck for JVM engineers is pure-Python loops over large data — vectorise with NumPy before reaching for Cython or C extensions.
  • Always measure before and after — timeit for micro-benchmarks, end-to-end latency/throughput for macro-benchmarks.
Share: