System Design Interviews

The Follow-ups They Don't Tell You About

Ravinder·June 3, 2025·9 min read

System DesignInterviewsArchitectureReliability

Series

System Design Interviews, Real

Part 10 of 10

← Part 9

Chat Walkthrough

End of series

The Follow-ups They Don't Tell You About

The first 40 minutes of a system design interview are largely predictable. Clarify scope, estimate capacity, sketch the happy path, discuss trade-offs. Most candidates prepare for this part. The follow-ups in the final 10–20 minutes are where the interview actually separates levels — and they almost always fall into three categories that guide preparation: failure scenarios, billing and metering, and regulatory or compliance probes. These are the questions that reveal whether you have run a system in production or only designed one on a whiteboard.

Category 1: Failure Scenarios

Interviewers at senior and staff levels pivot to failure late in the interview. They are not trying to trick you — they are testing whether your design has observable, operable, and recoverable failure modes.

The Cascade Failure Probe

"What happens if your cache layer goes down?"

The naive answer: "Requests fall through to the database." The interviewer follow-up: "Your cache was absorbing 90% of read traffic. The database now receives 10× its normal load. What happens?"

The right answer addresses cascade prevention:

Circuit breaker on the DB: if error rate on DB reads exceeds a threshold (e.g., 50% over 10 seconds), open the circuit — return cached stale data or a degraded response rather than hammering the DB. This requires that your system can serve from stale data for some SLAs.

Graduated cache failure: if you have a two-tier cache (L1 in-process, L2 Redis), losing Redis does not immediately expose the database — L1 absorbs some traffic. Size L1 for this scenario.

Pre-provisioned read replicas: keep headroom in your DB read replica fleet for the cache-absent scenario. If your cache normally absorbs 9,000 of 10,000 RPS, your DB replica fleet should be sized for 10,000 RPS, not 1,000.

The Message Queue Backlog

"What happens if your Kafka consumer group falls behind?"

Consumer lag is a real production incident pattern. If your fanout workers process 10,000 messages/second and message rate spikes to 30,000/second, you build a backlog of 20,000 messages/second. If this continues for 10 minutes, you have 12 million unprocessed messages.

Answers interviewers want to hear:

Monitoring: consumer lag as a first-class metric with alerts at 30-second, 5-minute, and 30-minute thresholds. Different thresholds trigger different responses.
Auto-scaling consumers: horizontal autoscaling of consumer instances triggered by lag depth. But — each partition can only be consumed by one consumer, so adding consumers beyond the partition count does nothing. Partition count must be planned for peak consumer parallelism.
Backpressure to producers: in extreme cases, the consumer service returns a 503 to producers or signals the write path to shed non-critical writes.

The Split-Brain Database Scenario

"Your primary database fails. Automatic failover promotes a replica. The old primary comes back online. What happens?"

Without fencing, you have two nodes that both believe they are the primary — a split-brain scenario. Writes to both nodes diverge. This is a data corruption incident.

Answers:

STONITH (Shoot The Other Node In The Head): the old primary is fenced at the network level before the new primary takes writes. Modern managed databases (RDS Multi-AZ, Cloud SQL) do this automatically.
Epoch numbers / term IDs: every primary election increments a term counter. Writes are only accepted from the node with the highest term. The old primary, on returning, sees a higher term and demotes itself.
Manual promotion in high-stakes systems: for financial or inventory systems, automatic failover to a replica with potential replication lag is not acceptable. You require human confirmation before promoting a replica that may be seconds behind.

flowchart TD F[Primary DB Fails] --> AF[Failover Triggered] AF --> FE[Fencing: block old primary network access] FE --> PR[Promote Replica to Primary] PR --> WR[Writes resume on new primary] F2[Old Primary Returns] --> ET[Check epoch/term] ET --> DM[Self-demotes to replica\nif lower term] DM --> RS[Resync from new primary]

Category 2: Billing and Metering

Billing questions appear most often in platform/API company interviews (Stripe, Twilio, AWS) and in any context where your design must track usage for monetization. They reveal whether you understand correctness requirements that go beyond application logic.

The Double-Count Problem

"How do you ensure a user is never billed twice for the same event?"

At-least-once message delivery (the default in Kafka, SQS, and most queues) means your billing consumer will sometimes process the same event more than once. A naive counter increment is non-idempotent.

Solutions:

Idempotency keys: each billable event carries a unique event ID. Before incrementing the usage counter, check if this event ID has been seen:

def process_billing_event(event):
    key = f"billed:{event.id}"
    if redis.setnx(key, "1"):  # SET if Not eXists — atomic
        redis.expire(key, 86400 * 7)  # 7-day dedup window
        increment_usage_counter(event.user_id, event.units)
    # else: duplicate, skip silently

Transactional outbox: write the usage increment and a "processed" marker in the same database transaction. The consumer only commits the Kafka offset after successfully writing both. No double-count possible.

Usage Aggregation at Scale

"You're metering API calls at 100K RPS. How do you aggregate usage per customer per billing period?"

100K writes/second to a per-customer counter is too hot for a relational database. Pattern:

In-memory local aggregation: each API server accumulates counts in memory per customer per 10-second window.
Periodic flush to a time-series store: every 10 seconds, each server flushes its local counts to a usage aggregation store (TimescaleDB, InfluxDB, or a custom Cassandra table).
Billing rollup job: a nightly batch job sums the 10-second windows into daily and monthly totals, writes to a billing ledger (append-only, durable).

The 10-second local aggregation window means usage metering can be up to 10 seconds stale — acceptable for billing (not real-time), and it reduces write pressure on the aggregation store by orders of magnitude.

Category 3: Regulatory and Compliance Probes

These questions appear at companies in regulated industries (fintech, healthcare, enterprise SaaS) and increasingly in consumer tech. They test whether you understand that systems operate in legal and regulatory contexts, not just technical ones.

The Right-to-Erasure Question

"A user exercises their GDPR right to erasure. How does your system delete all their data?"

Most systems are not designed for deletion. Data is replicated, cached, archived, and indexed. A right-to-erasure request requires:

Data inventory: you must know every place user data lives. If you do not maintain a data map, you cannot confidently execute an erasure.
Cascading deletes vs. tombstoning: hard deletes from a distributed system are expensive and slow. A common pattern is to tombstone the user record (mark as deleted, purge PII fields) and propagate the tombstone event to all downstream systems via an event stream. Downstream systems (caches, search indexes, analytics pipelines) consume the event and remove the data.
Backup retention: GDPR allows you to retain data in backups for the backup's normal retention period, as long as it is not restored for processing. You must document this exception clearly.
Completion timeline: GDPR requires deletion within 30 days. Your deletion pipeline must have SLA monitoring.

The Audit Log Question

"How do you prove to a regulator what happened to a specific record over its lifetime?"

Audit logs must be:

Append-only: no modification or deletion of audit records. Immutable storage (AWS S3 Object Lock, Worm storage) prevents tampering.
Cryptographically verifiable: each log entry is signed or chained (hash of previous entry included in current entry) so you can detect gaps or alterations.
Complete: every state transition of a regulated record must be captured — who changed it, when, from what value, to what value, from which IP, under which authorization.

audit_log record:
  {record_id, event_type, actor_id, actor_ip, before_state, after_state,
   timestamp, prev_entry_hash, entry_hash}

This is not your application log. It is a purpose-built, compliance-grade ledger — a different infrastructure component.

The Regulatory Data Residency Question

"Your product is used in the EU. What does data residency compliance require?"

The answer spans architecture and operations:

Data stored in EU regions only: no replication of EU user PII to non-EU regions. In AWS terms: restrict to eu-west-1/eu-central-1, disable cross-region replication for PII-containing tables.
Processing in-region: EU user data must be processed by infrastructure in the EU. Routing EU requests to US servers for computation violates data residency requirements.
Contractual controls with sub-processors: every third-party service that touches EU data (logging provider, analytics vendor, email provider) must have a DPA (Data Processing Agreement) in place.

Tying It Together: The Follow-up Mindset

The common thread across all three categories is that follow-up questions test operational maturity, not architectural novelty. The candidate who answers "What happens when your cache goes down?" with a cascade prevention strategy, a monitoring story, and a degraded-mode response plan is demonstrating that they have been on call. The candidate who says "we'd fix the cache quickly" is demonstrating that they have not.

Before your next interview, prepare two or three failure scenarios for every major subsystem in your design. Prepare a metering and billing story for any system that handles user actions you would eventually monetize. And if the company is in a regulated space, prepare your data residency and erasure answers before you walk in.

flowchart LR D[Your Design] --> F[Failure Scenarios\nper subsystem] D --> B[Billing & Metering\nidempotency + aggregation] D --> R[Regulatory\nerasure + audit + residency] F --> OM[Operational Maturity\nSignal] B --> OM R --> OM OM --> SL[Senior / Staff\nLanding Zone]

Key Takeaways

Cascade failure questions test whether your design has circuit breakers, degraded modes, and pre-provisioned headroom — not just a happy path.
Consumer lag in message queues is a first-class failure mode; plan for monitoring thresholds, horizontal scaling limits (bounded by partition count), and backpressure.
Billing correctness requires idempotency at the consumer level — at-least-once delivery is the default; double-counting is the failure mode.
Right-to-erasure is not a delete button — it requires a data inventory, tombstone propagation, and a pipeline with a 30-day SLA.
Audit logs for compliance are a separate infrastructure from application logs — append-only, cryptographically verifiable, and purpose-built.
Preparing failure, billing, and regulatory answers before the interview signals operational maturity, which is the distinguishing factor at senior and staff levels.