Skip to main content
Platform Engineering

Service Catalogs

Ravinder··6 min read
Platform EngineeringDevOpsIDPService CatalogBackstage
Share:
Service Catalogs

"Who owns this?" is the question that should take three seconds and a search. In most organisations it takes three Slack threads, two incorrect answers, and a half-hour of archaeology. The service catalog exists to make that question trivially answerable — and then to answer a hundred adjacent questions the same way.

This is not glamorous work. The catalog has no flashy demo, no impressive latency chart. But the first time an incident war room spends 45 minutes trying to figure out who owns a failing downstream service, you'll understand exactly what the absence of a catalog costs.

What Belongs in a Catalog

A catalog is a registry of software entities and their relationships. At minimum:

Entity kind What it represents Key fields
Component A deployable unit (service, library, website) owner, type, lifecycle, dependencies
API An interface published by a component spec (OpenAPI/AsyncAPI), consumers
System A logical grouping of components domain
Resource External infrastructure (database, queue, S3 bucket) component owner
Group A team members, parent
User An individual member of groups

The relationships between these are what make the catalog valuable. A component depends on another component. A component provides an API. A system contains components. These edges are the dependency graph.

graph TD G1[payments-team] -->|owns| C1[payments-api] G1 -->|owns| C2[fraud-detector] C1 -->|provides| A1[PaymentsAPI v2] C1 -->|depends on| C2 C1 -->|depends on| R1[(payments-db)] C3[checkout-service] -->|consumes| A1 G2[checkout-team] -->|owns| C3

That graph is what incident responders need in the first two minutes of an outage. It's also what architects need when planning migrations. It's what security teams need for blast-radius analysis. One data model, many consumers.

catalog-info.yaml: The Source of Truth

The practical implementation: every service repo gets a catalog-info.yaml at the root. This file is the service's identity document. It gets parsed by Backstage (or whatever catalog backend you use) and joined with data from other sources.

# catalog-info.yaml
apiVersion: backstage.io/v1alpha1
kind: Component
metadata:
  name: payments-api
  description: "Payment processing service — authorises, captures, and refunds"
  annotations:
    github.com/project-slug: my-org/payments-api
    pagerduty.com/service-id: PXXXXX
    grafana.com/dashboard-url: https://grafana.my-org.com/d/payments
    runbooks.my-org.com/url: https://wiki.my-org.com/runbooks/payments-api
  tags:
    - payments
    - pci-dss
    - critical
spec:
  type: service
  lifecycle: production
  owner: payments-team
  system: payments-platform
  dependsOn:
    - component:fraud-detector
    - resource:payments-db
  providesApis:
    - payments-api-v2

The annotations are where the catalog becomes a dashboard hub. Instead of bookmarking 15 URLs, engineers start in the catalog and click through to the dashboard, runbook, or on-call schedule.

Ownership: The Hard Part

Getting the YAML written is easy. Keeping the ownership data accurate is not.

Ownership goes stale in predictable ways:

  • Team reorganisations that don't propagate to catalog-info.yaml
  • Services that outlive the teams that built them
  • "Orphaned" services where the owner group no longer exists

Mitigation strategies:

Validate ownership in CI. If the team listed in owner doesn't exist in the catalog, the pull request fails. This catches stale ownership at the point of change.

# .github/workflows/catalog-validate.yml
name: Validate catalog-info
on: [pull_request]
jobs:
  validate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Validate catalog-info.yaml
        run: |
          python scripts/validate_catalog.py \
            --catalog-info catalog-info.yaml \
            --known-groups groups.yaml \
            --fail-on-unknown-owner

Monthly ownership sweep. A cron job that opens issues against services where the listed owner has had no activity in the last 90 days.

Tie ownership to on-call. If your catalog's owner field drives your PagerDuty routing, stale ownership has immediate consequences. That consequence keeps it accurate faster than any process doc.

Discovery: The Actual Payoff

Once the catalog has good data, discovery changes fundamentally. Instead of asking "what does the checkout team depend on?" you query the catalog.

Common discovery patterns:

Dependency path. "Show me every service that transitively depends on payments-api." Before a planned maintenance window, you now know who to notify.

Ownership search. "Show me all components owned by the platform team." Useful for understanding team scope and for cross-team work estimates.

Lifecycle filter. "Show me all services still marked experimental that have been in production for more than six months." A forcing function for lifecycle hygiene.

API consumer graph. "Who is consuming payments-api-v2?" Before you deprecate it, you know exactly which teams need a migration.

flowchart LR Q[Who depends on payments-api?] --> C[(Catalog)] C --> R1[checkout-service / checkout-team] C --> R2[subscription-service / billing-team] C --> R3[reporting-api / data-team] R1 & R2 & R3 --> N[Notify these teams of maintenance window]

Keeping It Populated

The catalog is only as good as its coverage. Strategies that work:

Scaffold it in. As covered in post 3, every service template ships with a pre-filled catalog-info.yaml. New services are born registered.

Bulk import from existing sources. GitHub org, AWS resource tags, Kubernetes labels — all of these can seed initial catalog entries. You won't get full metadata, but you get a starting list that teams can fill in.

Gamification (lightly). A "catalog score" that shows completeness percentage per component. Teams with 100% get a label. Below 60% triggers a Slack message to the owner. Feels silly; works surprisingly well.

Never make it optional for incidents. If PagerDuty routes to the catalog owner, missing or stale catalog data directly affects who gets paged. Operational pressure is the strongest data quality incentive.

Key Takeaways

  • A service catalog answers "who owns this?" in seconds — the absence of one shows up most painfully during incidents when there's no time to do archaeology.
  • The dependency graph encoded in the catalog is the highest-leverage artifact for incident response, migration planning, and blast-radius analysis.
  • catalog-info.yaml in the repository root is the practical source of truth; annotations connect it to dashboards, runbooks, and on-call schedules.
  • Ownership data goes stale predictably — validate it in CI and tie it to on-call routing so staleness has immediate operational consequences.
  • Discovery is the payoff: querying the catalog for dependency paths and API consumers transforms a maintenance window from a guessing game into a notification list.
  • Catalog coverage compounds — seed from existing sources, scaffold it into every new service, and use gamification lightly to drive completeness.
Share: