Architecture

Designing for Graceful Degradation

Ravinder·September 29, 2025·11 min read

ArchitectureReliabilityFeature FlagsResilience

The Binary Failure Model Is Wrong

Most systems are designed with an implicit assumption: either the system works, or it does not. Engineers build for the happy path. When something breaks, the system either errors out or — worse — silently returns bad data. Users see 500s. On-call engineers get paged. The incident begins.

This is the wrong mental model. Real systems degrade. A database goes slow before it goes down. An upstream API starts timing out on one endpoint before the whole service fails. A CDN saturates in one region before requests fail globally. The question is not whether your system will experience partial failure — it will — but whether the partial failure surfaces as a cliff edge or a staircase.

Graceful degradation means designing the staircase: a deliberate hierarchy of reduced functionality that serves users progressively less capability rather than zero capability. This requires building the degradation modes before the outage, not during it.

This post covers how to design degradation hierarchies, implement fallback caches, build read-only modes, and wire feature flags into your reliability architecture — with examples from real outages.

Degradation Hierarchies: Define Your Stairs

The first step is to enumerate your system's functionality in order of criticality. Not every feature is equally important. An e-commerce platform can survive with read-only product listings even if the cart is down. A dashboard application can show yesterday's data if today's pipeline is broken.

For every system I build, I define three degradation levels explicitly before launch:

flowchart TD Normal["Level 0 — Full Functionality\nAll features, live data, all services available"] Normal --> Degraded["Level 1 — Degraded\nCore user journey works\nNon-critical features disabled\nStale data acceptable"] Degraded --> Minimal["Level 2 — Minimal\nRead-only operations only\nCached / pre-computed data\nNo writes accepted"] Minimal --> Maintenance["Level 3 — Maintenance Mode\nStatic page only\nSystem unavailable message\nStatus page link"]

For a hypothetical SaaS analytics product, this might look like:

Feature	Level 0	Level 1	Level 2	Level 3
View dashboard	Live data	Stale data (1h)	Pre-computed snapshot	No
Create reports	Yes	Yes	No	No
Export data	Yes	Queued	No	No
Edit settings	Yes	Yes	No	No
Invitations	Yes	No	No	No
Real-time updates	Yes	No	No	No

Writing this table forces a conversation about what "broken" actually means. Level 1 is not failure — it is a degraded mode with a clearly communicated experience. Define it before the incident, not during.

Feature Flags as Degradation Switches

Feature flags are usually discussed as deployment tools. They are also your most important reliability tool — the mechanism that lets you switch degradation levels at runtime without deployment.

// Feature flag client with degradation semantics
interface DegradationConfig {
  realtimeUpdates: boolean;
  liveDataFetch: boolean;
  writesEnabled: boolean;
  invitationsEnabled: boolean;
  exportEnabled: boolean;
  maxStaleDataAgeSeconds: number;
}
 
const DEGRADATION_DEFAULTS: DegradationConfig = {
  realtimeUpdates: true,
  liveDataFetch: true,
  writesEnabled: true,
  invitationsEnabled: true,
  exportEnabled: true,
  maxStaleDataAgeSeconds: 300,
};
 
class DegradationManager {
  private flagClient: FeatureFlagClient;
  private cache: Map<string, DegradationConfig> = new Map();
 
  constructor(flagClient: FeatureFlagClient) {
    this.flagClient = flagClient;
  }
 
  async getConfig(tenantId?: string): Promise<DegradationConfig> {
    const key = tenantId ?? "__global__";
 
    // Return defaults immediately if flag service is unavailable
    try {
      const flags = await this.flagClient.getAllFlags(tenantId);
      const config: DegradationConfig = {
        realtimeUpdates:       flags.get("realtime-updates") ?? true,
        liveDataFetch:         flags.get("live-data-fetch") ?? true,
        writesEnabled:         flags.get("writes-enabled") ?? true,
        invitationsEnabled:    flags.get("invitations-enabled") ?? true,
        exportEnabled:         flags.get("export-enabled") ?? true,
        maxStaleDataAgeSeconds: flags.get("max-stale-age-seconds") ?? 300,
      };
      this.cache.set(key, config);
      return config;
    } catch {
      // Flag service down — return last known config or defaults
      return this.cache.get(key) ?? DEGRADATION_DEFAULTS;
    }
  }
}

The critical property of this implementation: if the feature flag service itself is unavailable, the system falls back to defaults (full functionality) rather than failing or disabling everything. The degradation manager must not be a single point of failure.

// Usage in API handler
async function getDashboardData(req: Request): Promise<Response> {
  const config = await degradationManager.getConfig(req.tenantId);
 
  if (config.liveDataFetch) {
    try {
      const data = await fetchLiveData(req.tenantId);
      return { data, source: "live", stale: false };
    } catch (err) {
      // Live fetch failed — fall through to stale data
      logger.warn({ err }, "Live data fetch failed, falling back to cache");
    }
  }
 
  // Stale data path
  const cached = await getCachedDashboard(req.tenantId);
  if (cached && Date.now() - cached.timestamp < config.maxStaleDataAgeSeconds * 1000) {
    return { data: cached.data, source: "cache", stale: true, cachedAt: cached.timestamp };
  }
 
  // No usable cache — return empty state rather than error
  return { data: emptyDashboardState(), source: "empty", stale: true };
}

Fallback Caches: The Most Underused Reliability Pattern

A fallback cache is a stale read cache that activates when the live source fails. It is different from a performance cache: a performance cache is warmed for speed; a fallback cache is specifically designed for failure scenarios.

The key properties of a good fallback cache:

Written on every successful live fetch — not just cache misses
Has a TTL longer than your longest expected outage — 24h, not 5 minutes
Is queryable without the live system being available — separate storage
Includes a staleness timestamp — so UI can communicate freshness to users

# Fallback cache implementation
import time
import json
import redis
from typing import Optional, TypeVar, Callable, Any
from dataclasses import dataclass, asdict
 
T = TypeVar('T')
 
@dataclass
class CachedValue:
    data: Any
    cached_at: float  # unix timestamp
    cache_version: str
 
class FallbackCache:
    def __init__(self, redis_client: redis.Redis, ttl_seconds: int = 86400):
        self.redis = redis_client
        self.ttl = ttl_seconds
 
    def get(self, key: str) -> Optional[CachedValue]:
        raw = self.redis.get(f"fallback:{key}")
        if raw:
            d = json.loads(raw)
            return CachedValue(**d)
        return None
 
    def set(self, key: str, data: Any, version: str = "v1"):
        value = CachedValue(data=data, cached_at=time.time(), cache_version=version)
        self.redis.setex(
            f"fallback:{key}",
            self.ttl,
            json.dumps(asdict(value))
        )
 
    def with_fallback(self, key: str, fetch_fn: Callable[[], T], version: str = "v1") -> tuple[T, bool]:
        """
        Fetch live data, write to fallback cache on success.
        On failure, return stale cached value.
        Returns (data, is_stale).
        """
        try:
            data = fetch_fn()
            self.set(key, data, version)
            return data, False
        except Exception as primary_error:
            cached = self.get(key)
            if cached:
                age_minutes = (time.time() - cached.cached_at) / 60
                logger.warning(
                    f"Primary fetch failed, serving {age_minutes:.1f}m stale cache",
                    exc_info=primary_error
                )
                return cached.data, True
            raise  # No cache available — propagate original error
 
# Usage
cache = FallbackCache(redis_client, ttl_seconds=86400)
 
def get_user_dashboard(user_id: str) -> DashboardData:
    data, is_stale = cache.with_fallback(
        key=f"dashboard:{user_id}",
        fetch_fn=lambda: live_dashboard_service.fetch(user_id)
    )
    if is_stale:
        data.banner = "Showing data from earlier today — live updates temporarily unavailable"
    return data

Circuit Breakers in the Degradation Context

Circuit breakers (covered in depth in the fan-out post) play a specific role in graceful degradation: they switch the system from "failing slowly" to "failing fast" which enables fallback paths to activate earlier.

stateDiagram-v2 [*] --> FullCapacity: Normal operation FullCapacity --> DegradedMode: Circuit opens on\ndownstream failure DegradedMode --> FullCapacity: Circuit closes\n(downstream recovers) DegradedMode --> ReadOnly: Writes start failing\n(DB degraded) ReadOnly --> DegradedMode: DB recovers ReadOnly --> Maintenance: Complete failure Maintenance --> ReadOnly: Partial recovery note right of DegradedMode Fallback cache active Stale data served Non-critical features disabled end note note right of ReadOnly No writes accepted Cache + pre-computed data Clear user messaging end note

# Circuit breaker wired into degradation manager
from circuit_breaker import CircuitBreaker, CircuitOpenError
 
class DataService:
    def __init__(self, db, cache: FallbackCache):
        self.db = db
        self.cache = cache
        self.circuit = CircuitBreaker(
            failure_threshold=5,
            timeout=30.0,
            name="database"
        )
 
    def get_dashboard(self, user_id: str) -> dict:
        try:
            data = self.circuit.call(self.db.query_dashboard, user_id)
            self.cache.set(f"dashboard:{user_id}", data)
            return {"data": data, "source": "live"}
        except CircuitOpenError:
            # Circuit is open — go straight to cache, do not attempt DB
            cached = self.cache.get(f"dashboard:{user_id}")
            if cached:
                return {"data": cached.data, "source": "cache", "stale": True}
            raise ServiceUnavailableError("Data temporarily unavailable")
        except Exception:
            # Other errors — circuit counts them
            cached = self.cache.get(f"dashboard:{user_id}")
            if cached:
                return {"data": cached.data, "source": "cache", "stale": True}
            raise

Read-Only Mode: A First-Class Operational State

Read-only mode is not just "all writes return 503." It is a deliberate system state with clear semantics, user communication, and — critically — the ability to enter and exit atomically without a deployment.

// Read-only mode middleware
import { Request, Response, NextFunction } from "express";
 
const WRITE_METHODS = new Set(["POST", "PUT", "PATCH", "DELETE"]);
const EXEMPT_PATHS = new Set(["/health", "/metrics", "/status"]);
 
async function readOnlyMiddleware(
  req: Request,
  res: Response,
  next: NextFunction
): Promise<void> {
  if (EXEMPT_PATHS.has(req.path)) {
    return next();
  }
 
  const isReadOnly = await featureFlags.get("system-read-only-mode");
 
  if (isReadOnly && WRITE_METHODS.has(req.method)) {
    res.status(503).json({
      error: "SERVICE_READ_ONLY",
      message: "The system is temporarily in read-only mode. Please try again shortly.",
      statusPage: "https://status.example.com",
      retryAfter: 60,
    });
    return;
  }
 
  if (isReadOnly) {
    // Annotate request so handlers can adjust their response
    res.locals.isReadOnly = true;
  }
 
  next();
}

// Client-side read-only handling
function DashboardPage() {
  const { isReadOnly } = useSystemStatus();
 
  return (
    <div>
      {isReadOnly && (
        <Banner type="warning">
          The system is in maintenance mode. You can view your data,
          but changes are temporarily disabled.
        </Banner>
      )}
      <DashboardContent readOnly={isReadOnly} />
    </div>
  );
}

Real Outage Examples: What Graceful Degradation Actually Saves You

Example 1: Payment provider timeout (2023, ~45-minute incident)

An e-commerce platform's payment provider started timing out. Without degradation: checkout fails, users cannot complete purchases, revenue impact is immediate and total.

With graceful degradation: the circuit breaker on the payment service opened after 5 failures. The checkout flow switched to "process order now, charge card async" mode — using a queue to hold payment attempts for retry when the provider recovered. Users saw "Your order is confirmed — payment processing." Zero sales lost during the 45-minute window. The payment provider recovered; queued payments processed successfully.

Example 2: Search service outage (Elasticsearch cluster, ~2 hours)

A product catalog service's Elasticsearch cluster lost quorum. Without degradation: search returns 500, product pages fail to load, users cannot find anything.

With graceful degradation: search requests fell back to PostgreSQL ILIKE queries — slower, less relevant, but functional. Category browsing remained fully functional from a separate cache. Users experienced slower search during the incident window. No support tickets about "broken" site — only a few complaints about slow search.

Example 3: Configuration service cold start storm

After a deployment, all instances simultaneously tried to fetch configuration from a remote service that had not finished warming up. Without degradation: all instances crashed on startup (config fetch error treated as fatal), deployment failed.

With graceful degradation: instances used their last-known-good config from a local file written at the previous successful start. Deployment completed. Config service caught up within 30 seconds. No incident.

Implementing a System Status Endpoint

Every system with graceful degradation needs a status endpoint that accurately reflects the current degradation level. Not just "healthy / unhealthy" — the actual capability level.

# System status endpoint
from flask import jsonify
import time
 
@app.route("/status")
def system_status():
    checks = {
        "database":       check_database(),
        "cache":          check_cache(),
        "payment":        check_payment_circuit(),
        "search":         check_search(),
        "feature_flags":  check_feature_flags(),
    }
 
    degradation_level = 0
    if not checks["database"]["healthy"]:
        degradation_level = max(degradation_level, 2)
    if not checks["payment"]["healthy"]:
        degradation_level = max(degradation_level, 1)
    if not checks["search"]["healthy"]:
        degradation_level = max(degradation_level, 1)
 
    return jsonify({
        "status": ["operational", "degraded", "minimal", "maintenance"][degradation_level],
        "degradationLevel": degradation_level,
        "checks": checks,
        "capabilities": {
            "reads":    degradation_level < 3,
            "writes":   degradation_level < 2,
            "search":   checks["search"]["healthy"],
            "payments": checks["payment"]["healthy"],
        },
        "timestamp": time.time(),
    }), 200 if degradation_level == 0 else 207

Return HTTP 207 (Multi-Status) for partially degraded systems. This is technically correct, signals to load balancers that the instance is not fully healthy, and is more useful than either 200 (lie) or 503 (overstatement).

Key Takeaways

Graceful degradation requires defining a degradation hierarchy explicitly before an incident — enumerate every feature, assign it a degradation level, and decide what the system looks like at each level before you are paging someone at 3am.
Feature flags are not just deployment tools; they are your runtime switches for degradation levels — the mechanism that lets you change system behaviour without a deployment when a dependency fails.
Fallback caches should be written on every successful live fetch and given TTLs long enough to survive your longest expected outage window (hours, not minutes).
Circuit breakers enable graceful degradation by switching from "failing slowly" to "failing fast" — this is what allows fallback paths to activate predictably rather than after timeout accumulation.
Read-only mode is a first-class system state, not a 503 response — it needs explicit middleware, client-side UI adaptation, and the ability to enter and exit without a deployment.
The /status endpoint should reflect actual capability levels with structured data, not just binary healthy/unhealthy — consumers (load balancers, status pages, monitoring) need to know what the system can and cannot do right now.