Performance and Profiling
A Java performance investigation has a well-worn playbook: attach async-profiler or enable Java Flight Recorder, load a flamegraph into JMC or Speedscope, identify the hot frame, fix the allocation pressure or lock contention, measure again. Python performance work follows the same scientific loop — but the tools differ, the baseline is slower, and the bottlenecks often look different than you expect.
The Performance Baseline Reality
Before profiling, recalibrate expectations. CPython executes roughly 10–50 million simple operations per second on modern hardware. A JVM with HotSpot JIT compiling hot paths will do 10–100x that. If your workload is pure computation, Python is the wrong tool for that layer — you need NumPy, C extensions, or a language boundary.
Profiling helps you find where Python time is actually spent — frequently the answer is "in a library's C extension, which is fast" or "in a pure Python loop you can vectorise."
cProfile — The Built-in Deterministic Profiler
cProfile is the stdlib deterministic profiler — it instruments every function call and records cumulative time. The Java equivalent is the sampling-based approach in JFR, but cProfile is closer to the old -Xrunhprof:cpu=times deterministic profiler.
import cProfile
import pstats
import io
def slow_function():
total = 0
for i in range(1_000_000):
total += i * i
return total
pr = cProfile.Profile()
pr.enable()
result = slow_function()
pr.disable()
s = io.StringIO()
ps = pstats.Stats(pr, stream=s).sort_stats("cumulative")
ps.print_stats(10) # top 10 by cumulative time
print(s.getvalue())Or from the command line:
python -m cProfile -o profile.out my_script.py
python -m pstats profile.outcProfile output:
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.412 0.412 0.412 0.412 script.py:4(slow_function)
1 0.000 0.000 0.412 0.412 {built-in method builtins.exec}tottime is time in the function excluding callees (like JFR self-time). cumtime includes callees (like JFR total-time).
Overhead: deterministic profiling adds 10–30% runtime overhead — do not run in production. Use it for local investigation.
line_profiler — Line-Level Granularity
When cProfile points to a function but not the specific line, line_profiler gives line-by-line breakdown:
# pip install line_profiler
from line_profiler import profile
@profile
def process_items(items: list[int]) -> int:
total = 0
for item in items: # ← line_profiler shows this is 60% of time
total += item * item
return totalkernprof -l -v script.pyThis is closer to async-profiler's line-level annotation in IntelliJ — you see exactly which statements are hot.
py-spy — The Production-Safe Sampling Profiler
py-spy is a sampling profiler (like async-profiler for the JVM) that attaches to a running process without modifying the code. It has near-zero overhead and works in production.
pip install py-spy
# Profile a running process by PID
py-spy top --pid 12345
# Generate a flamegraph
py-spy record -o profile.svg --pid 12345 --duration 30
# Profile a command directly
py-spy record -o profile.svg -- python my_app.pyThe flamegraph output is identical in interpretation to an async-profiler flamegraph: x-axis is time proportion, y-axis is call stack depth. Wide frames at the top are your hotspots.
memray — Memory Profiling
memray (from Bloomberg) is the Python equivalent of JFR's memory allocation profiling:
pip install memray
# Profile memory allocations
python -m memray run --output profile.bin my_script.py
python -m memray flamegraph profile.bin # generates HTML flamegraph
# Live view
python -m memray run --live my_script.pyCommon memory bottlenecks for JVM engineers:
- Large list comprehensions that could be generators
- String concatenation in a loop (
"a" + "b"creates a new object each iteration — use"".join(parts)) - Holding references longer than needed (Python's reference counting relies on objects going out of scope)
Benchmarking with timeit
For micro-benchmarks (equivalent to JMH), use timeit:
import timeit
# List comprehension vs generator for sum
list_time = timeit.timeit(
"sum([i*i for i in range(10000)])",
number=10_000
)
gen_time = timeit.timeit(
"sum(i*i for i in range(10000))",
number=10_000
)
print(f"list: {list_time:.3f}s generator: {gen_time:.3f}s")Or from the command line:
python -m timeit "sum(i*i for i in range(10000))"Common Bottlenecks and Fixes
| Symptom | Python cause | Fix |
|---|---|---|
| Slow loop over large data | Pure Python iteration | NumPy vectorisation |
| High memory, many small objs | __dict__ on many instances |
@dataclass(slots=True) |
| Slow JSON parsing | json.loads pure Python |
orjson or ujson (C extensions) |
| Slow regex | Complex backtracking | re2 binding (linear-time) |
| Slow HTTP client | requests (sync) |
httpx async or aiohttp |
| CPU-bound single-threaded | GIL | multiprocessing, NumPy, Cython |
The Investigation Loop
This mirrors the JFR + JMC workflow: record → analyse → hypothesise → fix → validate. The tools are different; the scientific method is identical.
Key Takeaways
cProfileis the stdlib deterministic profiler — use it locally; it adds 10–30% overhead so do not run in production.py-spyis the production-safe sampling profiler (like async-profiler) — it attaches by PID, generates flamegraphs, and has near-zero overhead.- Flamegraph interpretation is identical between async-profiler and py-spy: wide frames = hot code paths.
memrayis JFR's allocation profiling equivalent — use it to diagnose memory growth and excessive object creation.- The most common Python bottleneck for JVM engineers is pure-Python loops over large data — vectorise with NumPy before reaching for Cython or C extensions.
- Always measure before and after —
timeitfor micro-benchmarks, end-to-end latency/throughput for macro-benchmarks.