System Design Interviews

Trade-off Vocabulary That Lands

Ravinder·April 22, 2025·7 min read

System DesignInterviewsArchitectureDistributed Systems

Series

System Design Interviews, Real

Part 4 of 10

← Part 3

The Five-Box Drawing and What Each Box Hides

Part 5 →

Newsfeed Walkthrough

The most common sentence in system design interviews is also the most abused: "well, according to the CAP theorem, we can only have two of three." Interviewers have heard this hundreds of times. Most of the time it signals that the candidate memorized a triangle and stopped thinking. Knowing the vocabulary is necessary — using it precisely, in context, with concrete implications, is what actually moves the conversation.

CAP: What It Actually Says

CAP theorem states that in the presence of a network partition, a distributed system must choose between consistency and availability. Not "pick any two of three" as a general design principle — specifically, what happens when nodes cannot communicate.

Consistency (in CAP) means linearizability: every read reflects the most recent write, as if the system were a single machine. This is a strong guarantee.

Availability means every non-failed node returns a response (not necessarily the most recent data).

Partition tolerance is not optional in any system that crosses a network boundary. Partitions happen — cables get cut, switches fail, cloud availability zones lose connectivity. Rejecting partition tolerance means rejecting distributed systems entirely.

So the real choice is: when a partition occurs, do you return an error (choose C) or return potentially stale data (choose A)? That is the decision. Everything else is context.

flowchart TD P[Network Partition Occurs] P --> C[Choose Consistency] P --> A[Choose Availability] C --> CE[Return error or wait for consensus\nBank transactions, inventory counts] A --> AE[Return stale data\nSocial feeds, DNS, caches]

PACELC: The More Useful Extension

The problem with CAP is that partitions are relatively rare in well-engineered networks. PACELC asks what you trade off when there is no partition — which is most of the time:

If there is a Partition: choose Availability or Consistency
Else (no partition): choose Latency or Consistency

This is much more useful for real design conversations. DynamoDB in its default configuration is PA/EL — it sacrifices consistency during partitions and prefers low latency over strict consistency in normal operation. Spanner is PC/EC — it maintains consistency during partitions and trades latency for consistency at all times.

When you are designing a user profile service, the interviewer probably does not care about partition behavior. They care about the daily normal-operation tradeoff: do you want 2ms reads with potential staleness, or 8ms reads with strong consistency? That is an EL vs EC question.

Naming Consistency Models Precisely

"Eventually consistent" is not one thing. There is a spectrum, and using precise names signals depth:

Linearizability (strong consistency): reads always return the most recent write. Requires coordination on every operation — expensive, slow. Use for: financial balances, inventory counts, seat booking.

Sequential consistency: operations appear to execute in some global order consistent with each program order, but not necessarily real-time order. Slightly weaker than linearizability, rarely used as an explicit design choice.

Causal consistency: if operation A caused operation B, all processes see A before B. Comments appearing before the posts they reply to. Used in: distributed social apps, collaborative editing.

Read-your-writes (session consistency): after you write, you always read your own write. Not guaranteed to others. Used in: profile updates (you see your own changes immediately), social posts (you see your own post immediately).

Eventual consistency: given no new writes, all replicas will converge. Weak guarantee — no timing bound, no ordering guarantee. Suitable for: analytics aggregates, social feed counts, DNS propagation.

Model	Strength	Common Use Case	Latency Cost
Linearizable	Highest	Payments, inventory	High
Causal	Medium	Social replies, chat	Medium
Read-your-writes	Medium	Profile edits	Low-medium
Eventual	Weakest	Counters, feeds, DNS	Low

Availability: Nines Are Not Enough

"We need five nines" is another phrase that lands badly without context. Availability percentages only matter when you pair them with what constitutes downtime and over what time window.

99.9% availability = ~8.7 hours downtime per year = ~43 minutes per month. For a social feed, tolerable. For a payment processor at peak, catastrophic.

The more useful framing: what is the blast radius of an outage? A 5-minute outage during off-peak hours has different business impact than a 5-minute outage during Black Friday checkout. Design your SLA around business risk, not raw percentage.

Also: availability is not just about uptime. A service that is "up" but returning 30% errors is not available. Your SLA should include error rate budgets (e.g., "< 0.1% 5xx responses") not just uptime.

Using Tradeoff Language in the Interview

The goal is to make every architectural choice sound deliberate. Compare:

Weak framing: "I'll use eventual consistency because it's faster."

Strong framing: "For the user timeline cache, I'll accept read-your-writes consistency but not linearizability. After a user posts, they should see their own post immediately — I'll route their next few reads to the primary for 5 seconds after a write. For everyone else's reads, eventual consistency through the replica is fine; a 1–2 second lag in seeing someone else's post is imperceptible."

The strong framing names the model, explains the business justification, and describes the implementation mechanism. That is a complete trade-off argument.

flowchart LR A[Design Decision Point] --> B{What is the business risk\nof stale data?} B -->|High - money, inventory| C[Linearizable\nhigher latency accepted] B -->|Medium - social, messaging| D[Causal or Read-Your-Writes\nroute writer reads to primary] B -->|Low - counters, feeds| E[Eventual consistency\nreplica reads, lowest latency]

Partition Handling: The Practical Question

When you do need to address partition behavior explicitly (geo-distributed systems, multi-region designs), stop and ask: what is worse — an error message or wrong data?

Wrong data is worse than an error: financial transactions, booking confirmations, authentication. Choose CP. Return an error and let the caller retry or escalate.

An error is worse than stale data: DNS lookups (stale cached record is fine, no response is catastrophic), social feeds, product catalogs. Choose AP. Return the best data you have.

In practice, this often becomes a mixed strategy: most of your system is AP by default, with a small CP core for the data that absolutely cannot be wrong (account balances, write-once identifiers). Design the blast radius: minimize how much of the system must be CP.

Vocabulary Traps to Avoid

"CAP theorem means we can't have all three." Partition tolerance is not a design variable — it is a given for any networked system. The real choice is always C vs. A during partitions.

"We'll use eventual consistency for performance." Eventual consistency is not a performance optimization technique. It is a correctness tradeoff. If your bottleneck is something else (CPU, I/O, connection pooling), eventual consistency will not fix it.

"We need strong consistency." Linearizability is extremely expensive in distributed systems. You almost always need something weaker — read-your-writes or causal consistency — not full linearizability. Know which operations truly need the strongest guarantee and apply it narrowly.

Key Takeaways

CAP's real question is narrow: when a partition occurs, do you return an error or stale data? Partition tolerance is not a design choice — it is a given.
PACELC is more practically useful: it asks about latency vs. consistency during normal operation, which is the tradeoff you make daily.
Know the consistency model spectrum precisely — linearizable, causal, read-your-writes, eventual — and match each to a concrete use case.
Availability SLAs need error budgets and blast-radius framing, not just uptime percentages.
Strong tradeoff arguments name the model, justify it with business impact, and describe the implementation mechanism.
Apply CP behavior narrowly to the data that cannot be wrong; let the rest of the system be AP for resilience and performance.