Property-Based Testing for Backend Engineers
Every example-based test is a hypothesis about your code. "Given this specific input, I expect this specific output." The problem is that production doesn't send you inputs you wrote in advance. Production finds the values you didn't think of — the Unicode string that breaks your slug generator, the order of operations that makes money arithmetic non-associative, the boundary exactly one off from the one you tested.
Property-based testing flips the model. Instead of specifying examples, you specify invariants: truths about your function that must hold for any valid input. The framework generates hundreds of inputs, tries to break your invariant, and when it succeeds, shrinks the failure to the smallest input that still breaks it. After running property tests in production codebases, I'm convinced they catch a category of bug that example tests structurally cannot.
The Mental Model: Properties vs. Examples
Consider a function that reverses a list. An example test:
def test_reverse():
assert reverse([1, 2, 3]) == [3, 2, 1]This passes. But it tests one input. A property test:
from hypothesis import given, strategies as st
@given(st.lists(st.integers()))
def test_reverse_involution(lst):
assert reverse(reverse(lst)) == lst
@given(st.lists(st.integers()))
def test_reverse_length_preserved(lst):
assert len(reverse(lst)) == len(lst)
@given(st.lists(st.integers()))
def test_reverse_elements_preserved(lst):
assert sorted(reverse(lst)) == sorted(lst)These three properties together fully specify what reverse must do for any list of integers. Hypothesis will generate empty lists, single-element lists, lists with duplicates, lists with negative integers, very large lists. The specification is the properties, not the examples.
Installing and Running Hypothesis
pip install hypothesisHypothesis integrates directly with pytest. Properties are ordinary test functions decorated with @given. No special runner required.
# tests/test_properties.py
from hypothesis import given, settings, assume, HealthCheck
from hypothesis import strategies as st
import pytestFor Java engineers, jqwik is the equivalent:
<!-- pom.xml -->
<dependency>
<groupId>net.jqwik</groupId>
<artifactId>jqwik</artifactId>
<version>1.8.4</version>
<scope>test</scope>
</dependency>import net.jqwik.api.*;
class ReverseProperties {
@Property
boolean reverseIsInvolution(@ForAll List<Integer> list) {
return reverse(reverse(list)).equals(list);
}
}Designing Good Properties
The hardest part of property-based testing is writing properties that are strong enough to be meaningful without accidentally reimplementing your function. Here are the patterns I reach for:
Round-trip properties: Encode and decode, serialize and deserialize, encrypt and decrypt. The composition should be identity.
@given(st.dictionaries(st.text(), st.integers()))
def test_json_roundtrip(data):
assert json.loads(json.dumps(data)) == dataThis looks trivial — and it is for simple data. But extend it to your domain types and it finds bugs:
from hypothesis import given, strategies as st
from myapp.models import Money
from myapp.serializers import MoneySerializer
@given(
amount=st.decimals(min_value=0, max_value=1_000_000, places=2),
currency=st.sampled_from(["USD", "EUR", "GBP", "JPY"])
)
def test_money_serialization_roundtrip(amount, currency):
original = Money(amount=amount, currency=currency)
serialized = MoneySerializer.serialize(original)
deserialized = MoneySerializer.deserialize(serialized)
assert deserialized == originalWe found a real bug with exactly this pattern: JPY amounts were being serialized with 2 decimal places (a float precision issue), then deserialized as a non-zero cent value. A bug that affected every JPY transaction but was never caught because our example tests used USD.
Invariant properties: Some truth that must hold regardless of input.
@given(st.decimals(min_value="0.01", max_value="1000000.00", places=2))
def test_tax_calculation_non_negative(amount):
tax = calculate_tax(amount, rate=Decimal("0.085"))
assert tax >= 0
@given(st.decimals(min_value="0.01", max_value="1000000.00", places=2))
def test_tax_never_exceeds_amount(amount):
tax = calculate_tax(amount, rate=Decimal("0.085"))
assert tax <= amountMetamorphic properties: If input A and input B are related in a known way, output A and output B must be related in a corresponding way.
@given(st.lists(st.integers(), min_size=1))
def test_sort_monotonic(lst):
sorted_lst = sort(lst)
for i in range(len(sorted_lst) - 1):
assert sorted_lst[i] <= sorted_lst[i + 1]
@given(
st.lists(st.integers(), min_size=1),
st.integers()
)
def test_sort_addition_order_independent(lst, extra):
# Adding an element doesn't change the relative order of existing elements
sorted_original = sort(lst)
sorted_with_extra = sort(lst + [extra])
original_in_new = [x for x in sorted_with_extra if x in lst or lst.count(x) > 0]
# This is a simplified metamorphic check — the real invariant is that
# elements not equal to `extra` maintain their relative order
assert len(sorted_with_extra) == len(lst) + 1A Real Bug: Money Arithmetic
Here's the actual Hypothesis test that found the money arithmetic bug we'd been shipping for three months. The issue: totaling a list of monetary amounts via sum() on Python Decimal objects was non-associative when one amount was zero.
from decimal import Decimal
from hypothesis import given, strategies as st, assume
def total_cart(items: list[Decimal]) -> Decimal:
# Buggy implementation
return sum(items) # sum() starts from int 0, not Decimal("0")
@given(st.lists(
st.decimals(min_value="0.00", max_value="9999.99", places=2, allow_nan=False, allow_infinity=False),
min_size=1,
max_size=20
))
def test_total_cart_type_stable(items):
result = total_cart(items)
# Result must be Decimal, not float or int
assert isinstance(result, Decimal), f"Expected Decimal, got {type(result)}: {result}"
@given(st.lists(
st.decimals(min_value="0.00", max_value="9999.99", places=2, allow_nan=False, allow_infinity=False),
min_size=1,
max_size=20
))
def test_total_cart_is_sum_of_parts(items):
total = total_cart(items)
# Splitting the list and summing parts should give the same result
mid = len(items) // 2
first_half = total_cart(items[:mid]) if items[:mid] else Decimal("0")
second_half = total_cart(items[mid:]) if items[mid:] else Decimal("0")
assert total == first_half + second_halfHypothesis found [Decimal("0.00")] as a counterexample to the first property. sum([Decimal("0.00")]) returns 0 (an integer), because Python's sum() initializes its accumulator to 0. The fix is sum(items, Decimal("0")). An obvious mistake — but we'd been testing with non-zero amounts in our example tests.
Shrinking: Why It Matters
When Hypothesis finds a failing input, it doesn't stop. It tries to shrink the input to the smallest, simplest form that still fails. This is not a minor convenience — it's the feature that makes property-based testing usable in production.
A raw counterexample might be a 847-element list with random values. The shrunk counterexample might be [0, -1]. The second is actionable. The first requires a debugger.
Shrinking is automatic, but you can guide it by writing composite strategies for domain types:
from hypothesis import strategies as st
@st.composite
def order(draw):
items = draw(st.lists(
st.fixed_dictionaries({
"price": st.decimals(min_value="0.01", max_value="999.99", places=2),
"quantity": st.integers(min_value=1, max_value=100),
"sku": st.text(min_size=3, max_size=20, alphabet=st.characters(whitelist_categories=("Lu", "Nd")))
}),
min_size=1,
max_size=10
))
return {"items": items, "currency": draw(st.sampled_from(["USD", "EUR"]))}
@given(order())
def test_order_total_matches_line_items(order):
total = calculate_order_total(order)
line_item_sum = sum(
item["price"] * item["quantity"]
for item in order["items"]
)
assert total == line_item_sumIntegrating with CI
Property tests are slower than example tests because they run many inputs. Configure the count deliberately:
from hypothesis import settings, HealthCheck
# Fast during development
@settings(max_examples=50)
@given(st.text())
def test_slug_generation_fast(text):
...
# Thorough in CI
@settings(max_examples=500, suppress_health_check=[HealthCheck.too_slow])
@given(st.text())
def test_slug_generation_thorough(text):
...Use the HYPOTHESIS_DATABASE_FILE environment variable to persist the Hypothesis database in CI, so found counterexamples are always replayed:
- name: Run property tests
env:
HYPOTHESIS_DATABASE_FILE: .hypothesis/ci-db
run: pytest tests/ -k property -v
- name: Cache Hypothesis database
uses: actions/cache@v4
with:
path: .hypothesis
key: hypothesis-${{ hashFiles('tests/**/*.py') }}jqwik for Java Engineers
The same patterns apply in Java with jqwik. Here's the money serialization property:
import net.jqwik.api.*;
import net.jqwik.api.constraints.*;
import java.math.BigDecimal;
class MoneyProperties {
@Property
boolean serializationRoundtrip(
@ForAll @BigRange(min = "0.00", max = "1000000.00") BigDecimal amount,
@ForAll @From("currencies") String currency
) {
Money original = new Money(amount, currency);
String serialized = MoneySerializer.serialize(original);
Money deserialized = MoneySerializer.deserialize(serialized);
return original.equals(deserialized);
}
@Provide
Arbitrary<String> currencies() {
return Arbitraries.of("USD", "EUR", "GBP", "JPY", "CHF");
}
@Property
boolean taxNeverExceedsAmount(
@ForAll @DoubleRange(min = 0.01, max = 1000000.0) double rawAmount
) {
BigDecimal amount = BigDecimal.valueOf(rawAmount).setScale(2, RoundingMode.HALF_UP);
BigDecimal tax = TaxCalculator.calculate(amount, new BigDecimal("0.25"));
return tax.compareTo(amount) <= 0;
}
}Key Takeaways
- Property-based testing covers a category of bug that example tests structurally cannot: the input you never thought to try. It complements, not replaces, example-based tests.
- Start with round-trip properties on your serialization and deserialization code — they have strong invariants and find bugs immediately.
- Hypothesis's shrinking is what makes counterexamples actionable. A framework without shrinking produces noise; Hypothesis produces minimal, reproducible failing inputs.
- Design composite strategies for your domain types so generated inputs are always semantically valid and shrunk counterexamples are readable.
- Persist the Hypothesis database in CI so previously discovered counterexamples are always replayed — this is cheap insurance against regressions.
- Money, pagination offsets, Unicode strings, and empty collections are where property tests earn their keep fastest. Start there before expanding to your full domain.