Skip to main content
Cloud Cost Engineering

Tags, Tags, Tags

Ravinder··5 min read
Cloud CostFinOpsAWSTaggingCost Allocation
Share:
Tags, Tags, Tags

Tags are the load-bearing columns of cost allocation. When they are missing or inconsistent, every cost conversation devolves into finger-pointing. When they are right, showback reports run themselves and rightsizing recommendations land in the correct Slack channel. The difference is architecture, not discipline.

Design the Taxonomy First

A tag taxonomy is a contract between engineering and finance. Design it before you enforce it.

Minimum required keys for any non-trivial cloud footprint:

Tag Key Format Example Required?
env lowercase enum prod, staging, dev Yes
team slug platform, payments, data Yes
product slug checkout, analytics Yes
cost-center numeric string CC-1042 Yes
managed-by enum terraform, cdk, manual Yes
repo short repo name infra-core Recommended

Keep keys lowercase with hyphens. Mixed cases and underscores create duplicate dimensions in Cost Explorer. Enforce a short allowlist of values for high-cardinality keys like team and env — free-text values are unusable for grouping.

Enforcement via Service Control Policies

SCPs block resource creation when mandatory tags are absent. This is the enforcement layer. Add it to every non-root OU.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "RequireCostAllocationTags",
      "Effect": "Deny",
      "Action": [
        "ec2:RunInstances",
        "rds:CreateDBInstance",
        "elasticloadbalancing:CreateLoadBalancer",
        "eks:CreateCluster",
        "s3:CreateBucket"
      ],
      "Resource": "*",
      "Condition": {
        "Null": {
          "aws:RequestTag/team": "true",
          "aws:RequestTag/env": "true",
          "aws:RequestTag/cost-center": "true"
        }
      }
    }
  ]
}

SCPs are preventive controls. They do not fix the past — that is a separate problem.

Retroactive Cleanup with AWS Config

AWS Config can identify untagged resources. Pair it with a Lambda remediation.

import boto3
import json
 
def lambda_handler(event, context):
    """
    Config remediation: tag resources that are missing required keys.
    Triggered by AWS Config rule 'required-tags'.
    """
    config = boto3.client('config')
    ec2 = boto3.client('ec2')
 
    # Parse the Config event
    invoking_event = json.loads(event['invokingEvent'])
    resource = invoking_event['configurationItem']
    resource_id = resource['resourceId']
    resource_type = resource['resourceType']
    account_id = resource['awsAccountId']
 
    if resource_type != 'AWS::EC2::Instance':
        return {'status': 'skipped', 'reason': 'unsupported resource type'}
 
    existing_tags = {t['key']: t['value'] for t in resource.get('tags', [])}
    required_keys = ['team', 'env', 'cost-center']
    missing = [k for k in required_keys if k not in existing_tags]
 
    if not missing:
        return {'status': 'compliant'}
 
    # Apply placeholder tags — a human must fill in real values
    placeholder_tags = [
        {'Key': k, 'Value': 'UNTAGGED-NEEDS-REMEDIATION'}
        for k in missing
    ]
    ec2.create_tags(Resources=[resource_id], Tags=placeholder_tags)
 
    print(f"Tagged {resource_id} with placeholders for: {missing}")
    return {'status': 'remediated', 'resource': resource_id, 'keys': missing}

Placeholder values are intentional. They surface in Cost Explorer as UNTAGGED-NEEDS-REMEDIATION, which is embarrassing enough to trigger human action.

Tag Propagation Architecture

flowchart LR subgraph IaC["Infrastructure as Code"] TF[Terraform Modules] CDK[CDK Constructs] end subgraph Enforcement["Enforcement Layer"] SCP[Service Control Policy] CFN[CloudFormation Stack Tags] TFP[Terraform Provider Tags] end subgraph Detection["Detection & Remediation"] CFG[AWS Config Rule] LMB[Lambda Remediation] SNS[SNS Alert to Team] end subgraph Reporting["Cost Reporting"] CE[Cost Explorer] CUR[CUR + Athena] DASH[Team Dashboards] end IaC --> Enforcement Enforcement -->|blocks non-compliant creates| Detection Detection -->|flags missing tags| LMB LMB -->|applies placeholders| SNS SNS -->|notifies team| Reporting Enforcement --> Reporting

Terraform: Default Tags at Provider Level

Stop tagging every resource manually. Set defaults at the provider.

# providers.tf
terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }
}
 
provider "aws" {
  region = var.aws_region
 
  default_tags {
    tags = {
      env         = var.environment
      team        = var.team
      cost-center = var.cost_center
      managed-by  = "terraform"
      repo        = var.repo_name
    }
  }
}
# variables.tf
variable "environment"  { type = string }
variable "team"         { type = string }
variable "cost_center"  { type = string }
variable "repo_name"    { type = string }

Every resource created by this provider inherits the default tags. Individual resources can add keys but cannot remove required ones without triggering the SCP.

Measuring Tag Coverage

Query CUR to measure coverage — run this weekly and track the trend.

SELECT
  DATE_TRUNC('week', line_item_usage_start_date) AS week,
  COUNT(DISTINCT line_item_resource_id)           AS total_resources,
  COUNT(DISTINCT CASE
    WHEN resource_tags_user_team      IS NOT NULL
     AND resource_tags_user_env       IS NOT NULL
     AND resource_tags_user_cost_center IS NOT NULL
    THEN line_item_resource_id
  END)                                            AS tagged_resources,
  ROUND(
    100.0 * COUNT(DISTINCT CASE
      WHEN resource_tags_user_team IS NOT NULL
       AND resource_tags_user_env  IS NOT NULL
       AND resource_tags_user_cost_center IS NOT NULL
      THEN line_item_resource_id END)
    / NULLIF(COUNT(DISTINCT line_item_resource_id), 0)
  , 1)                                            AS coverage_pct
FROM cur_db.cur_table
WHERE line_item_usage_start_date >= CURRENT_DATE - INTERVAL '90' DAY
  AND line_item_line_item_type = 'Usage'
GROUP BY 1
ORDER BY 1;

Set a team OKR: tag coverage above 95 %. Below 80 % means allocation reports are unreliable.

Key Takeaways

  • A tag taxonomy is a contract — finalize keys and allowed values in a shared document before enforcement, not after.
  • SCPs are the only reliable preventive enforcement mechanism; rely on convention alone and coverage will drift within weeks.
  • Retroactive cleanup with placeholder values is better than no cleanup — visible badness motivates teams faster than invisible badness.
  • Terraform default_tags at the provider level eliminates the most common source of missing tags: engineers forgetting per-resource blocks.
  • Tag coverage below 80 % makes cost allocation guesswork; treat it as a P2 incident with an owner and an SLA.
  • Measure coverage weekly from CUR, not from AWS Config alone — CUR reflects what is actually being billed, which is the only number that matters.
Share: