Designing for Graceful Degradation
The Binary Failure Model Is Wrong
Most systems are designed with an implicit assumption: either the system works, or it does not. Engineers build for the happy path. When something breaks, the system either errors out or — worse — silently returns bad data. Users see 500s. On-call engineers get paged. The incident begins.
This is the wrong mental model. Real systems degrade. A database goes slow before it goes down. An upstream API starts timing out on one endpoint before the whole service fails. A CDN saturates in one region before requests fail globally. The question is not whether your system will experience partial failure — it will — but whether the partial failure surfaces as a cliff edge or a staircase.
Graceful degradation means designing the staircase: a deliberate hierarchy of reduced functionality that serves users progressively less capability rather than zero capability. This requires building the degradation modes before the outage, not during it.
This post covers how to design degradation hierarchies, implement fallback caches, build read-only modes, and wire feature flags into your reliability architecture — with examples from real outages.
Degradation Hierarchies: Define Your Stairs
The first step is to enumerate your system's functionality in order of criticality. Not every feature is equally important. An e-commerce platform can survive with read-only product listings even if the cart is down. A dashboard application can show yesterday's data if today's pipeline is broken.
For every system I build, I define three degradation levels explicitly before launch:
For a hypothetical SaaS analytics product, this might look like:
| Feature | Level 0 | Level 1 | Level 2 | Level 3 |
|---|---|---|---|---|
| View dashboard | Live data | Stale data (1h) | Pre-computed snapshot | No |
| Create reports | Yes | Yes | No | No |
| Export data | Yes | Queued | No | No |
| Edit settings | Yes | Yes | No | No |
| Invitations | Yes | No | No | No |
| Real-time updates | Yes | No | No | No |
Writing this table forces a conversation about what "broken" actually means. Level 1 is not failure — it is a degraded mode with a clearly communicated experience. Define it before the incident, not during.
Feature Flags as Degradation Switches
Feature flags are usually discussed as deployment tools. They are also your most important reliability tool — the mechanism that lets you switch degradation levels at runtime without deployment.
// Feature flag client with degradation semantics
interface DegradationConfig {
realtimeUpdates: boolean;
liveDataFetch: boolean;
writesEnabled: boolean;
invitationsEnabled: boolean;
exportEnabled: boolean;
maxStaleDataAgeSeconds: number;
}
const DEGRADATION_DEFAULTS: DegradationConfig = {
realtimeUpdates: true,
liveDataFetch: true,
writesEnabled: true,
invitationsEnabled: true,
exportEnabled: true,
maxStaleDataAgeSeconds: 300,
};
class DegradationManager {
private flagClient: FeatureFlagClient;
private cache: Map<string, DegradationConfig> = new Map();
constructor(flagClient: FeatureFlagClient) {
this.flagClient = flagClient;
}
async getConfig(tenantId?: string): Promise<DegradationConfig> {
const key = tenantId ?? "__global__";
// Return defaults immediately if flag service is unavailable
try {
const flags = await this.flagClient.getAllFlags(tenantId);
const config: DegradationConfig = {
realtimeUpdates: flags.get("realtime-updates") ?? true,
liveDataFetch: flags.get("live-data-fetch") ?? true,
writesEnabled: flags.get("writes-enabled") ?? true,
invitationsEnabled: flags.get("invitations-enabled") ?? true,
exportEnabled: flags.get("export-enabled") ?? true,
maxStaleDataAgeSeconds: flags.get("max-stale-age-seconds") ?? 300,
};
this.cache.set(key, config);
return config;
} catch {
// Flag service down — return last known config or defaults
return this.cache.get(key) ?? DEGRADATION_DEFAULTS;
}
}
}The critical property of this implementation: if the feature flag service itself is unavailable, the system falls back to defaults (full functionality) rather than failing or disabling everything. The degradation manager must not be a single point of failure.
// Usage in API handler
async function getDashboardData(req: Request): Promise<Response> {
const config = await degradationManager.getConfig(req.tenantId);
if (config.liveDataFetch) {
try {
const data = await fetchLiveData(req.tenantId);
return { data, source: "live", stale: false };
} catch (err) {
// Live fetch failed — fall through to stale data
logger.warn({ err }, "Live data fetch failed, falling back to cache");
}
}
// Stale data path
const cached = await getCachedDashboard(req.tenantId);
if (cached && Date.now() - cached.timestamp < config.maxStaleDataAgeSeconds * 1000) {
return { data: cached.data, source: "cache", stale: true, cachedAt: cached.timestamp };
}
// No usable cache — return empty state rather than error
return { data: emptyDashboardState(), source: "empty", stale: true };
}Fallback Caches: The Most Underused Reliability Pattern
A fallback cache is a stale read cache that activates when the live source fails. It is different from a performance cache: a performance cache is warmed for speed; a fallback cache is specifically designed for failure scenarios.
The key properties of a good fallback cache:
- Written on every successful live fetch — not just cache misses
- Has a TTL longer than your longest expected outage — 24h, not 5 minutes
- Is queryable without the live system being available — separate storage
- Includes a staleness timestamp — so UI can communicate freshness to users
# Fallback cache implementation
import time
import json
import redis
from typing import Optional, TypeVar, Callable, Any
from dataclasses import dataclass, asdict
T = TypeVar('T')
@dataclass
class CachedValue:
data: Any
cached_at: float # unix timestamp
cache_version: str
class FallbackCache:
def __init__(self, redis_client: redis.Redis, ttl_seconds: int = 86400):
self.redis = redis_client
self.ttl = ttl_seconds
def get(self, key: str) -> Optional[CachedValue]:
raw = self.redis.get(f"fallback:{key}")
if raw:
d = json.loads(raw)
return CachedValue(**d)
return None
def set(self, key: str, data: Any, version: str = "v1"):
value = CachedValue(data=data, cached_at=time.time(), cache_version=version)
self.redis.setex(
f"fallback:{key}",
self.ttl,
json.dumps(asdict(value))
)
def with_fallback(self, key: str, fetch_fn: Callable[[], T], version: str = "v1") -> tuple[T, bool]:
"""
Fetch live data, write to fallback cache on success.
On failure, return stale cached value.
Returns (data, is_stale).
"""
try:
data = fetch_fn()
self.set(key, data, version)
return data, False
except Exception as primary_error:
cached = self.get(key)
if cached:
age_minutes = (time.time() - cached.cached_at) / 60
logger.warning(
f"Primary fetch failed, serving {age_minutes:.1f}m stale cache",
exc_info=primary_error
)
return cached.data, True
raise # No cache available — propagate original error
# Usage
cache = FallbackCache(redis_client, ttl_seconds=86400)
def get_user_dashboard(user_id: str) -> DashboardData:
data, is_stale = cache.with_fallback(
key=f"dashboard:{user_id}",
fetch_fn=lambda: live_dashboard_service.fetch(user_id)
)
if is_stale:
data.banner = "Showing data from earlier today — live updates temporarily unavailable"
return dataCircuit Breakers in the Degradation Context
Circuit breakers (covered in depth in the fan-out post) play a specific role in graceful degradation: they switch the system from "failing slowly" to "failing fast" which enables fallback paths to activate earlier.
# Circuit breaker wired into degradation manager
from circuit_breaker import CircuitBreaker, CircuitOpenError
class DataService:
def __init__(self, db, cache: FallbackCache):
self.db = db
self.cache = cache
self.circuit = CircuitBreaker(
failure_threshold=5,
timeout=30.0,
name="database"
)
def get_dashboard(self, user_id: str) -> dict:
try:
data = self.circuit.call(self.db.query_dashboard, user_id)
self.cache.set(f"dashboard:{user_id}", data)
return {"data": data, "source": "live"}
except CircuitOpenError:
# Circuit is open — go straight to cache, do not attempt DB
cached = self.cache.get(f"dashboard:{user_id}")
if cached:
return {"data": cached.data, "source": "cache", "stale": True}
raise ServiceUnavailableError("Data temporarily unavailable")
except Exception:
# Other errors — circuit counts them
cached = self.cache.get(f"dashboard:{user_id}")
if cached:
return {"data": cached.data, "source": "cache", "stale": True}
raiseRead-Only Mode: A First-Class Operational State
Read-only mode is not just "all writes return 503." It is a deliberate system state with clear semantics, user communication, and — critically — the ability to enter and exit atomically without a deployment.
// Read-only mode middleware
import { Request, Response, NextFunction } from "express";
const WRITE_METHODS = new Set(["POST", "PUT", "PATCH", "DELETE"]);
const EXEMPT_PATHS = new Set(["/health", "/metrics", "/status"]);
async function readOnlyMiddleware(
req: Request,
res: Response,
next: NextFunction
): Promise<void> {
if (EXEMPT_PATHS.has(req.path)) {
return next();
}
const isReadOnly = await featureFlags.get("system-read-only-mode");
if (isReadOnly && WRITE_METHODS.has(req.method)) {
res.status(503).json({
error: "SERVICE_READ_ONLY",
message: "The system is temporarily in read-only mode. Please try again shortly.",
statusPage: "https://status.example.com",
retryAfter: 60,
});
return;
}
if (isReadOnly) {
// Annotate request so handlers can adjust their response
res.locals.isReadOnly = true;
}
next();
}// Client-side read-only handling
function DashboardPage() {
const { isReadOnly } = useSystemStatus();
return (
<div>
{isReadOnly && (
<Banner type="warning">
The system is in maintenance mode. You can view your data,
but changes are temporarily disabled.
</Banner>
)}
<DashboardContent readOnly={isReadOnly} />
</div>
);
}Real Outage Examples: What Graceful Degradation Actually Saves You
Example 1: Payment provider timeout (2023, ~45-minute incident)
An e-commerce platform's payment provider started timing out. Without degradation: checkout fails, users cannot complete purchases, revenue impact is immediate and total.
With graceful degradation: the circuit breaker on the payment service opened after 5 failures. The checkout flow switched to "process order now, charge card async" mode — using a queue to hold payment attempts for retry when the provider recovered. Users saw "Your order is confirmed — payment processing." Zero sales lost during the 45-minute window. The payment provider recovered; queued payments processed successfully.
Example 2: Search service outage (Elasticsearch cluster, ~2 hours)
A product catalog service's Elasticsearch cluster lost quorum. Without degradation: search returns 500, product pages fail to load, users cannot find anything.
With graceful degradation: search requests fell back to PostgreSQL ILIKE queries — slower, less relevant, but functional. Category browsing remained fully functional from a separate cache. Users experienced slower search during the incident window. No support tickets about "broken" site — only a few complaints about slow search.
Example 3: Configuration service cold start storm
After a deployment, all instances simultaneously tried to fetch configuration from a remote service that had not finished warming up. Without degradation: all instances crashed on startup (config fetch error treated as fatal), deployment failed.
With graceful degradation: instances used their last-known-good config from a local file written at the previous successful start. Deployment completed. Config service caught up within 30 seconds. No incident.
Implementing a System Status Endpoint
Every system with graceful degradation needs a status endpoint that accurately reflects the current degradation level. Not just "healthy / unhealthy" — the actual capability level.
# System status endpoint
from flask import jsonify
import time
@app.route("/status")
def system_status():
checks = {
"database": check_database(),
"cache": check_cache(),
"payment": check_payment_circuit(),
"search": check_search(),
"feature_flags": check_feature_flags(),
}
degradation_level = 0
if not checks["database"]["healthy"]:
degradation_level = max(degradation_level, 2)
if not checks["payment"]["healthy"]:
degradation_level = max(degradation_level, 1)
if not checks["search"]["healthy"]:
degradation_level = max(degradation_level, 1)
return jsonify({
"status": ["operational", "degraded", "minimal", "maintenance"][degradation_level],
"degradationLevel": degradation_level,
"checks": checks,
"capabilities": {
"reads": degradation_level < 3,
"writes": degradation_level < 2,
"search": checks["search"]["healthy"],
"payments": checks["payment"]["healthy"],
},
"timestamp": time.time(),
}), 200 if degradation_level == 0 else 207Return HTTP 207 (Multi-Status) for partially degraded systems. This is technically correct, signals to load balancers that the instance is not fully healthy, and is more useful than either 200 (lie) or 503 (overstatement).
Key Takeaways
- Graceful degradation requires defining a degradation hierarchy explicitly before an incident — enumerate every feature, assign it a degradation level, and decide what the system looks like at each level before you are paging someone at 3am.
- Feature flags are not just deployment tools; they are your runtime switches for degradation levels — the mechanism that lets you change system behaviour without a deployment when a dependency fails.
- Fallback caches should be written on every successful live fetch and given TTLs long enough to survive your longest expected outage window (hours, not minutes).
- Circuit breakers enable graceful degradation by switching from "failing slowly" to "failing fast" — this is what allows fallback paths to activate predictably rather than after timeout accumulation.
- Read-only mode is a first-class system state, not a 503 response — it needs explicit middleware, client-side UI adaptation, and the ability to enter and exit without a deployment.
- The
/statusendpoint should reflect actual capability levels with structured data, not just binary healthy/unhealthy — consumers (load balancers, status pages, monitoring) need to know what the system can and cannot do right now.