Skip to main content
API Design Mastery

SDK Ergonomics

Ravinder··5 min read
API DesignRESTGraphQLgRPCSDK Design
Share:
SDK Ergonomics

A well-designed API surface can still be painful to use if the SDK wrapping it is an afterthought. Most generated clients are just thin HTTP wrappers — they handle serialization but leave retry logic, observability, timeout handling, and authentication plumbing to each integrator. Every developer writes that boilerplate themselves, slightly differently, with slightly different bugs. An ergonomic SDK absorbs that complexity once and frees integrators to focus on their own problems.

Generated vs Handwritten SDKs

The debate is mostly false. The right answer is generated-from-spec with handwritten ergonomics layered on top.

flowchart TD A[OpenAPI / Proto Spec] --> B[Code Generator] B --> C[Generated Layer\nModels, HTTP calls, serialization] C --> D[Ergonomics Layer\nRetries, auth, observability, helpers] D --> E[Published SDK] E --> F[Developer Code]

The generated layer handles:

  • Request/response serialization
  • Model types with proper nullability
  • Endpoint methods matching the API surface

The ergonomics layer (handwritten) handles:

  • Authentication injection
  • Retry with backoff
  • Timeout configuration
  • Logging and tracing hooks
  • Pagination helpers
  • Convenience constructors

Keep the generated layer thin and rarely touched. Put the developer experience investment in the ergonomics layer.

Retry Logic

Bad retry logic is worse than no retry logic. An SDK that retries unconditionally on 500 can amplify load on a struggling server. Get the contract right:

interface RetryConfig {
  maxAttempts: number;     // default: 3
  initialDelay: number;    // default: 500ms
  maxDelay: number;        // default: 30000ms
  multiplier: number;      // default: 2
  jitter: boolean;         // default: true
  retryableStatusCodes: number[];  // [408, 429, 500, 502, 503, 504]
  retryableErrors: string[];       // ['ECONNRESET', 'ETIMEDOUT']
}
 
async function withRetry<T>(
  fn: () => Promise<T>,
  config: RetryConfig
): Promise<T> {
  let attempt = 0;
  while (true) {
    try {
      return await fn();
    } catch (err) {
      if (!isRetryable(err, config) || attempt >= config.maxAttempts - 1) {
        throw err;
      }
      const delay = computeBackoff(attempt, config);
      await sleep(delay);
      attempt++;
    }
  }
}
 
function computeBackoff(attempt: number, config: RetryConfig): number {
  const base = config.initialDelay * Math.pow(config.multiplier, attempt);
  const capped = Math.min(base, config.maxDelay);
  return config.jitter ? capped * (0.5 + Math.random() * 0.5) : capped;
}

Never retry 4xx errors except 408 (Request Timeout) and 429 (Rate Limited — use the Retry-After header). Always retry with a fresh idempotency key on POST requests that are retryable.

Observability Hooks

An SDK that developers can instrument is dramatically easier to debug in production:

interface SDKHooks {
  onRequest?: (ctx: RequestContext) => void;
  onResponse?: (ctx: ResponseContext) => void;
  onError?: (ctx: ErrorContext) => void;
  onRetry?: (ctx: RetryContext) => void;
}
 
// Usage: wire into OpenTelemetry
const client = new ApiClient({
  apiKey: process.env.API_KEY,
  hooks: {
    onRequest: (ctx) => {
      span = tracer.startSpan(`api.${ctx.method} ${ctx.path}`);
      span.setAttributes({ "http.method": ctx.method, "http.url": ctx.url });
    },
    onResponse: (ctx) => {
      span.setAttributes({ "http.status_code": ctx.status });
      span.end();
    },
    onError: (ctx) => {
      span.recordException(ctx.error);
      span.setStatus({ code: SpanStatusCode.ERROR });
      span.end();
    }
  }
});

Expose hooks, not hard-coded logging. Different users want different observability backends — OpenTelemetry, Datadog, custom logging. Hooks let them integrate anything without forking the SDK.

Pagination as First-Class Ergonomics

Raw pagination APIs return a cursor and let the developer write the loop. Good SDKs abstract this:

// Raw — developer writes the loop
let cursor: string | undefined;
do {
  const page = await client.orders.list({ limit: 50, cursor });
  processOrders(page.data);
  cursor = page.pagination.nextCursor;
} while (page.pagination.hasMore);
 
// Ergonomic — async iterator
for await (const order of client.orders.listAll({ limit: 50 })) {
  processOrder(order);
}
 
// Or: collect all into array (use cautiously on large datasets)
const orders = await client.orders.listAll({ limit: 50 }).toArray();

The iterator implementation handles the cursor management and stops cleanly when hasMore is false. This is the difference between "the API has pagination" and "the SDK handles pagination for you."

Breaking Changes in SDKs

SDKs have their own breaking-change surface on top of the API's. In a typed language, breaking changes include:

  • Renaming a method or class
  • Adding a required parameter
  • Changing a return type
  • Removing an optional field from a model
  • Changing an enum value

Follow semantic versioning strictly. A major version bump (1.x.x → 2.x.x) signals breaking changes. In the release notes, be explicit:

## Breaking Changes in v2.0.0
 
- `client.orders.get(id)` renamed to `client.orders.retrieve(id)`
- `Order.customerName` (deprecated since v1.8) removed; use `Order.customer.name`
- `listOrders()` now returns `AsyncIterable<Order>` instead of `Page<Order>`
 
## Migration Guide
 
Replace all calls to `get()` with `retrieve()`:
```diff
- const order = await client.orders.get("ord_123");
+ const order = await client.orders.retrieve("ord_123");

Provide codemods (AST transforms) for large migrations when possible.

Configuration Ergonomics

Sensible defaults with explicit override points:

const client = new ApiClient({
  apiKey: process.env.API_KEY,       // required
  baseUrl: "https://api.example.com", // override for staging
  timeout: 30_000,                   // 30s default
  retry: { maxAttempts: 3 },         // configurable
  userAgent: "my-app/1.2.3",         // appended to default UA
});

Read API_KEY, BASE_URL from environment variables by convention when not passed explicitly. Never hardcode default timeouts below 10 seconds for network calls — developers will discover the hard limit in production.

Key Takeaways

  • Generate the model and HTTP layer from your OpenAPI or Proto spec, but invest handwritten effort in the ergonomics layer: retries, auth, observability, and pagination helpers.
  • Only retry on 408, 429, and 5xx errors; read Retry-After for 429; never retry non-idempotent requests without a fresh idempotency key.
  • Expose hooks for request/response/error/retry events rather than hard-coded logging so users can plug in any observability backend.
  • Abstract pagination behind async iterators — hiding cursor management is one of the highest-leverage ergonomic improvements an SDK can make.
  • Treat SDK method signatures with the same breaking-change discipline as your API surface; use semantic versioning and publish explicit migration guides with codemods for major versions.
  • Read credentials and base URLs from environment variables by convention; default timeouts should be at least 30 seconds to survive real network conditions.
Share: