Skip to main content
Platform Engineering

Self-Service Infra

Ravinder··6 min read
Platform EngineeringDevOpsIDPTerraformInfrastructure as Code
Share:
Self-Service Infra

The most common way a platform team becomes a bottleneck is through infrastructure provisioning. A product team needs an RDS instance. They open a ticket. The platform team reviews it in the next sprint. Four days later, a database exists. Everyone is slightly frustrated, the product team wonders why they can't just click "create database" in the AWS console, and the platform team is buried in tickets they didn't sign up to process forever.

Self-service infra is the answer. But "self-service" done wrong means product teams using the AWS console directly, creating ungoverned resources that violate compliance policies, accumulate costs nobody tracks, and drift from the desired state the moment someone logs in to "just check something."

The goal is self-service with guardrails: product teams move fast, the platform team's standards travel with them.

The Module Pattern

The foundational building block is a versioned Terraform module that encodes your standards. Product teams consume the module; they don't write infrastructure from scratch.

# modules/postgres/main.tf — platform-owned module
terraform {
  required_providers {
    aws = { source = "hashicorp/aws", version = "~> 5.0" }
  }
}
 
variable "service_name" {
  description = "Owning service name (used in resource tags and naming)"
  type        = string
}
 
variable "environment" {
  description = "Deployment environment"
  type        = string
  validation {
    condition     = contains(["dev", "staging", "production"], var.environment)
    error_message = "Environment must be dev, staging, or production."
  }
}
 
variable "instance_class" {
  description = "RDS instance class"
  type        = string
  default     = "db.t4g.medium"
  validation {
    condition     = startswith(var.instance_class, "db.t4g") || startswith(var.instance_class, "db.r7g")
    error_message = "Only Graviton instance classes are approved for cost efficiency."
  }
}
 
resource "aws_db_instance" "this" {
  identifier        = "${var.service_name}-${var.environment}"
  engine            = "postgres"
  engine_version    = "16"
  instance_class    = var.instance_class
  allocated_storage = 20
  storage_encrypted = true          # non-negotiable
  deletion_protection = var.environment == "production"
 
  backup_retention_period = var.environment == "production" ? 7 : 1
  backup_window           = "03:00-04:00"
 
  tags = {
    service     = var.service_name
    environment = var.environment
    managed-by  = "terraform"
    cost-center = var.service_name   # critical for post 8
  }
}

Notice what the module enforces without negotiation: encryption at rest, deletion protection in production, backup retention, and cost-attribution tags. These aren't options — they're baked in. Product teams can choose instance class (within approved types) and environment. Everything else is standardised.

Module Registry and Versioning

Platform modules need to be versioned like any other public API. Teams consuming v1.0.0 should not suddenly find their infrastructure behaviour changing because the platform team pushed a fix.

# product team's service.tf
module "database" {
  source  = "git::https://github.com/my-org/platform-modules//modules/postgres?ref=v2.1.0"
 
  service_name   = "payments-api"
  environment    = var.environment
  instance_class = "db.r7g.large"   # override for payments load
}
graph LR T[Product Team] -->|consumes| M[Platform Module v2.1.0] M -->|enforces| G1[Encryption at rest] M -->|enforces| G2[Cost attribution tags] M -->|enforces| G3[Deletion protection] M -->|allows| O1[Instance class choice] M -->|allows| O2[Storage size] P[Platform Team] -->|publishes| M P -->|announces deprecation| V1[v1.x.x]

Use semantic versioning. Publish a changelog. Give teams at least two months notice before deprecating a major version. Treat your internal module consumers with the same respect you'd give external API consumers.

Policy Guardrails with OPA / Checkov

Modules handle standards at the Terraform level. Policies handle standards at the plan level — catching violations before terraform apply runs.

# checkov custom policy — deny public S3 buckets
# platform-policies/CKV_CUSTOM_1.py
from checkov.common.models.enums import CheckResult
from checkov.terraform.checks.resource.base_resource_check import BaseResourceCheck
 
class S3BucketPublicAccessCheck(BaseResourceCheck):
    def __init__(self):
        name = "Ensure S3 bucket blocks public access"
        id = "CKV_CUSTOM_1"
        supported_resources = ["aws_s3_bucket_public_access_block"]
        categories = ["GENERAL_SECURITY"]
        super().__init__(name=name, id=id, categories=categories,
                         supported_resources=supported_resources)
 
    def scan_resource_conf(self, conf):
        if conf.get("block_public_acls") == [True] and \
           conf.get("block_public_policy") == [True]:
            return CheckResult.PASSED
        return CheckResult.FAILED
# .github/workflows/infra-pr-check.yml
name: Infrastructure PR checks
on:
  pull_request:
    paths:
      - '**.tf'
 
jobs:
  policy-check:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
 
      - name: Run Checkov
        uses: bridgecrewio/checkov-action@v12
        with:
          directory: .
          config_file: .checkov.yaml
          external_checks_dir: platform-policies/
 
      - name: Terraform plan
        run: |
          terraform init
          terraform plan -out=tfplan
 
      - name: OPA policy evaluation
        run: |
          terraform show -json tfplan > plan.json
          opa eval --data platform-policies/terraform.rego \
                   --input plan.json \
                   "data.terraform.deny" \
                   --fail-defined

Drift Detection

Drift is the gap between what Terraform thinks exists and what actually exists in the cloud account. It happens when someone logs into the console "just to check something" and adjusts a security group rule. It happens during incidents. It happens during migrations.

Left undetected, drift means your IaC is no longer the source of truth — and the next terraform apply might change something you're not expecting.

# Scheduled drift detection via GitHub Actions
# .github/workflows/drift-detect.yml
name: Drift detection
on:
  schedule:
    - cron: '0 6 * * *'   # daily at 6am
 
jobs:
  detect-drift:
    strategy:
      matrix:
        workspace: [dev, staging, production]
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
 
      - name: terraform plan (drift check)
        id: plan
        run: |
          terraform init -backend-config=backends/${{ matrix.workspace }}.hcl
          terraform workspace select ${{ matrix.workspace }}
          terraform plan -detailed-exitcode -out=tfplan 2>&1 | tee plan_output.txt
          echo "exitcode=$?" >> $GITHUB_OUTPUT
 
      - name: Alert on drift
        if: steps.plan.outputs.exitcode == '2'
        uses: slackapi/slack-github-action@v1
        with:
          payload: |
            {
              "text": "Drift detected in ${{ matrix.workspace }} — <${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}|View plan>"
            }
        env:
          SLACK_WEBHOOK_URL: ${{ secrets.PLATFORM_ALERTS_SLACK }}

Exit code 2 from terraform plan -detailed-exitcode means changes are needed — drift was detected. Exit code 0 means clean. Exit code 1 means an error.

Key Takeaways

  • Self-service infra succeeds when product teams can provision infrastructure without filing a ticket and without the platform team's standards being optional.
  • Versioned Terraform modules are the primary guardrail mechanism — bake in non-negotiables (encryption, cost tags, deletion protection) while leaving appropriate choices to consumers.
  • Treat internal module consumers like external API consumers: semantic versioning, changelogs, and deprecation notice periods.
  • Policy tools (Checkov, OPA) catch violations at plan time in CI, before infrastructure is created — far cheaper than remediation after the fact.
  • Drift is inevitable; the question is whether you detect it daily or discover it during an incident. Scheduled plan checks with alerting make drift visible.
  • The measure of success is not module count — it's whether product teams can ship infrastructure changes without involving the platform team in the critical path.
Share: