Self-Service Infra
The most common way a platform team becomes a bottleneck is through infrastructure provisioning. A product team needs an RDS instance. They open a ticket. The platform team reviews it in the next sprint. Four days later, a database exists. Everyone is slightly frustrated, the product team wonders why they can't just click "create database" in the AWS console, and the platform team is buried in tickets they didn't sign up to process forever.
Self-service infra is the answer. But "self-service" done wrong means product teams using the AWS console directly, creating ungoverned resources that violate compliance policies, accumulate costs nobody tracks, and drift from the desired state the moment someone logs in to "just check something."
The goal is self-service with guardrails: product teams move fast, the platform team's standards travel with them.
The Module Pattern
The foundational building block is a versioned Terraform module that encodes your standards. Product teams consume the module; they don't write infrastructure from scratch.
# modules/postgres/main.tf — platform-owned module
terraform {
required_providers {
aws = { source = "hashicorp/aws", version = "~> 5.0" }
}
}
variable "service_name" {
description = "Owning service name (used in resource tags and naming)"
type = string
}
variable "environment" {
description = "Deployment environment"
type = string
validation {
condition = contains(["dev", "staging", "production"], var.environment)
error_message = "Environment must be dev, staging, or production."
}
}
variable "instance_class" {
description = "RDS instance class"
type = string
default = "db.t4g.medium"
validation {
condition = startswith(var.instance_class, "db.t4g") || startswith(var.instance_class, "db.r7g")
error_message = "Only Graviton instance classes are approved for cost efficiency."
}
}
resource "aws_db_instance" "this" {
identifier = "${var.service_name}-${var.environment}"
engine = "postgres"
engine_version = "16"
instance_class = var.instance_class
allocated_storage = 20
storage_encrypted = true # non-negotiable
deletion_protection = var.environment == "production"
backup_retention_period = var.environment == "production" ? 7 : 1
backup_window = "03:00-04:00"
tags = {
service = var.service_name
environment = var.environment
managed-by = "terraform"
cost-center = var.service_name # critical for post 8
}
}Notice what the module enforces without negotiation: encryption at rest, deletion protection in production, backup retention, and cost-attribution tags. These aren't options — they're baked in. Product teams can choose instance class (within approved types) and environment. Everything else is standardised.
Module Registry and Versioning
Platform modules need to be versioned like any other public API. Teams consuming v1.0.0 should not suddenly find their infrastructure behaviour changing because the platform team pushed a fix.
# product team's service.tf
module "database" {
source = "git::https://github.com/my-org/platform-modules//modules/postgres?ref=v2.1.0"
service_name = "payments-api"
environment = var.environment
instance_class = "db.r7g.large" # override for payments load
}Use semantic versioning. Publish a changelog. Give teams at least two months notice before deprecating a major version. Treat your internal module consumers with the same respect you'd give external API consumers.
Policy Guardrails with OPA / Checkov
Modules handle standards at the Terraform level. Policies handle standards at the plan level — catching violations before terraform apply runs.
# checkov custom policy — deny public S3 buckets
# platform-policies/CKV_CUSTOM_1.py
from checkov.common.models.enums import CheckResult
from checkov.terraform.checks.resource.base_resource_check import BaseResourceCheck
class S3BucketPublicAccessCheck(BaseResourceCheck):
def __init__(self):
name = "Ensure S3 bucket blocks public access"
id = "CKV_CUSTOM_1"
supported_resources = ["aws_s3_bucket_public_access_block"]
categories = ["GENERAL_SECURITY"]
super().__init__(name=name, id=id, categories=categories,
supported_resources=supported_resources)
def scan_resource_conf(self, conf):
if conf.get("block_public_acls") == [True] and \
conf.get("block_public_policy") == [True]:
return CheckResult.PASSED
return CheckResult.FAILED# .github/workflows/infra-pr-check.yml
name: Infrastructure PR checks
on:
pull_request:
paths:
- '**.tf'
jobs:
policy-check:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run Checkov
uses: bridgecrewio/checkov-action@v12
with:
directory: .
config_file: .checkov.yaml
external_checks_dir: platform-policies/
- name: Terraform plan
run: |
terraform init
terraform plan -out=tfplan
- name: OPA policy evaluation
run: |
terraform show -json tfplan > plan.json
opa eval --data platform-policies/terraform.rego \
--input plan.json \
"data.terraform.deny" \
--fail-definedDrift Detection
Drift is the gap between what Terraform thinks exists and what actually exists in the cloud account. It happens when someone logs into the console "just to check something" and adjusts a security group rule. It happens during incidents. It happens during migrations.
Left undetected, drift means your IaC is no longer the source of truth — and the next terraform apply might change something you're not expecting.
# Scheduled drift detection via GitHub Actions
# .github/workflows/drift-detect.yml
name: Drift detection
on:
schedule:
- cron: '0 6 * * *' # daily at 6am
jobs:
detect-drift:
strategy:
matrix:
workspace: [dev, staging, production]
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: terraform plan (drift check)
id: plan
run: |
terraform init -backend-config=backends/${{ matrix.workspace }}.hcl
terraform workspace select ${{ matrix.workspace }}
terraform plan -detailed-exitcode -out=tfplan 2>&1 | tee plan_output.txt
echo "exitcode=$?" >> $GITHUB_OUTPUT
- name: Alert on drift
if: steps.plan.outputs.exitcode == '2'
uses: slackapi/slack-github-action@v1
with:
payload: |
{
"text": "Drift detected in ${{ matrix.workspace }} — <${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}|View plan>"
}
env:
SLACK_WEBHOOK_URL: ${{ secrets.PLATFORM_ALERTS_SLACK }}Exit code 2 from terraform plan -detailed-exitcode means changes are needed — drift was detected. Exit code 0 means clean. Exit code 1 means an error.
Key Takeaways
- Self-service infra succeeds when product teams can provision infrastructure without filing a ticket and without the platform team's standards being optional.
- Versioned Terraform modules are the primary guardrail mechanism — bake in non-negotiables (encryption, cost tags, deletion protection) while leaving appropriate choices to consumers.
- Treat internal module consumers like external API consumers: semantic versioning, changelogs, and deprecation notice periods.
- Policy tools (Checkov, OPA) catch violations at plan time in CI, before infrastructure is created — far cheaper than remediation after the fact.
- Drift is inevitable; the question is whether you detect it daily or discover it during an incident. Scheduled plan checks with alerting make drift visible.
- The measure of success is not module count — it's whether product teams can ship infrastructure changes without involving the platform team in the critical path.