# Scaling Terraform Across many Teams: A Native Framework for Platform Engineering

Scaling Terraform Across 50+ Teams: A Native Framework for Platform Engineering TL;DR: A...

Scaling Terraform Across 50+ Teams: A Native Framework for Platform Engineering

TL;DR: A pure Terraform framework that lets 50+ teams self-service infrastructure by writing simple .tfvars files while the platform team manages opinionated “building blocks.” Smart lookups (s3:bucket_name) enable cross-resource references. When patterns improve, automated scripts generate PRs for all teams—they review terraform plan and inherit improvements without code changes. 85%+ boilerplate reduction, zero preprocessing, fully compatible with Terraform Cloud.

This blog post documents how a platform engineering team built a Terraform framework that scales to 50+ application teams with mixed skill levels—enabling fast, self-service infrastructure deployment while maintaining governance and security standards.

┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ 50+ Teams │ │ Platform │ │ Patterns │
│ Write Simple │─────>│ Manages │─────>│ Improve │
│ tfvars │ │ Building Blocks │ │ Over Time │
└─────────────────┘ └─────────────────┘ └─────────────────┘
│ │
│ ▼
│ ┌─────────────────┐
│ │ Automated │
│ │ PRs Generated │
│ └─────────────────┘
│ │
│ ▼
│ ┌─────────────────┐
│ │ Teams Review │
│ │ terraform plan │
│ └─────────────────┘
│ │
│ ▼
│ ┌─────────────────┐
└──────────────────│ Approve & Apply│
(updates) │ Stay Current │
└─────────────────┘

The Challenge: Platform teams face an impossible trade-off: let teams write their own Terraform (resulting in inconsistent, outdated implementations) or manually review and update every workload (doesn’t scale beyond ~10 teams).

The Solution: A native Terraform framework that separates configuration (what teams deploy) from implementation (how it’s deployed securely). Application teams write simple .tfvars files, platform team manages opinionated “building blocks” that evolve over time. When patterns improve (adding VPC, encryption, monitoring), automated scripts generate PRs for all teams—they review terraform plan and approve, inheriting improvements without code changes.

Key Innovation: Native Terraform “smart lookups” (s3:bucket_name, lambda:function_name) allow cross-resource references while maintaining the separation. No preprocessing, no code generation—pure Terraform compatible with standard tooling and Terraform Cloud.

Target Audiences

  • Platform Engineers: Detailed implementation of the lookup mechanism and building block architecture
  • DevOps/SRE Teams: Comparison with Terragrunt/Terraspace and practical benefits
  • Cloud Architects: Strategic value and governance capabilities
  • Technical Leaders: Development velocity improvements and complexity reduction

1. Introduction: Helping Teams Build Faster at Scale

Opening Hook:

“How do you help 50 teams build and deploy infrastructure faster—when they have different levels of AWS and Terraform expertise, need similar-but-not-identical workloads, and your platform team can’t manually review and update every project?”

The Human Challenge: Speed vs. Standards

Picture this familiar scenario:

Your Organization:

  • 50+ application teams building data pipelines, microservices, analytics platforms
  • Mixed skill levels:
    • 20% have AWS experts who know IAM policies inside-out
    • 50% are competent with Terraform but learning AWS services
    • 30% are new to both, just want to deploy their application
  • Platform/DevOps team of 5-10 people responsible for:
    • Cloud governance and security
    • Cost optimization
    • Compliance and best practices
    • Supporting all those teams

What Application Teams Want:

  • Deploy fast: Days, not weeks of waiting
  • Self-service: Don’t wait for platform team approval on every change
  • Focus on their app: Not become AWS/Terraform experts
  • Consistency: “Just tell me what works and let me copy it”

What Platform Team Needs:

  • Enforce standards: Security, tagging, encryption, monitoring
  • Scale support: Can’t grow team 1:1 with application teams
  • Continuous improvement: Patterns evolve as we learn
  • Prevent drift: All workloads stay current with best practices

The Core Problem: Similar Workloads, Different Implementations

When teams write their own Terraform, you get variations of the same infrastructure:

Option 1: Raw Terraform Resources (Maximum Flexibility, Minimum Maintainability)

# Team A writes Lambda in January 2024
resource "aws_lambda_function" "processor_v1" {
function_name = "processor"
runtime = "python3.11"
# ... 50 lines of configuration
# Missing: VPC config, proper IAM policies, CloudWatch retention
}
# Team B writes Lambda in March 2024 (learned from Team A's mistakes)
resource "aws_lambda_function" "processor_v2" {
function_name = "processor"
runtime = "python3.12"
# ... 80 lines of configuration
# Now includes: VPC, better IAM, but still missing X-Ray tracing
}
# Team C writes Lambda in June 2024 (organization learned best practices)
resource "aws_lambda_function" "processor_v3" {
function_name = "processor"
runtime = "python3.13"
# ... 120 lines of configuration
# All best practices: VPC, IAM, X-Ray, proper logging, tags
}

The Problems:

  • Inconsistent implementations: 50 workloads = 50 slightly different Lambda configurations
  • Knowledge doesn’t propagate: Teams A and B don’t benefit from improvements learned by Team C
  • Backporting is impossible: How do you update 50 workloads when security requires KMS encryption?
  • Copy-paste culture: Teams copy from each other, propagating old patterns and bugs
  • Expertise silos: Only AWS experts can write correct infrastructure

Option 2: Standard Terraform Modules (Better Reuse, Still Hard to Evolve)

# Using terraform-aws-modules/lambda/aws
module "lambda" {
source = "terraform-aws-modules/lambda/aws"
version = "4.0.0"
function_name = "processor"
# ... still 40+ lines of configuration
# Better: module handles some best practices
# Problem: upgrading 50 workloads from v4.0.0 → v5.0.0 is manual work
}

The Problems:

  • Version sprawl: Workloads stuck on different module versions (v3.2, v4.0, v4.5, v5.0)
  • Breaking changes: Module updates require testing every workload
  • Configuration drift: Each team configures modules differently
  • Limited abstraction: Still requires deep AWS knowledge to use correctly
  • Manual upgrades: Someone has to update 50 PRs when a new version releases

The Real Challenge: N×N Complexity

As you improve your infrastructure patterns over time:

  • You learn Lambda should use VPC → Need to update 50 workloads
  • Security requires KMS encryption → Need to update 50 workloads
  • Compliance requires specific tags → Need to update 50 workloads
  • New AWS best practice emerges → Need to update 50 workloads

The math is brutal:

  • 50 workloads × 10 resource types × 5 improvements per year = 2,500 manual updates
  • Each update risks breaking something
  • Each workload drifts further from best practices
  • Teams become afraid to improve shared patterns

Our Solution: True Separation of Code and Configuration

The Insight: What if we could update how infrastructure is created without touching what infrastructure exists?

# Team writes configuration ONCE (2024)
lambda_functions = {
processor = {
name = "processor"
runtime = "python3.13"
permissions = {
s3_read = ["raw_data"]
}
}
}

Behind the scenes (managed by platform team):

  • January 2024: Lambda building block v1.0 (basic implementation)
  • March 2024: Lambda building block v1.5 (adds VPC, better IAM)
  • June 2024: Lambda building block v2.0 (adds X-Ray, proper logging)
  • September 2024: Lambda building block v2.5 (adds permission boundaries)

The team’s configuration never changes. The platform team updates the building block implementation, and all 50 workloads automatically get improvements on next terraform apply.

This Framework Achieves:

  • Separation of Concerns: Configuration (what) lives in tfvars, implementation (how) lives in building blocks
  • Continuous Improvement: Platform team evolves patterns without breaking workloads
  • Zero Backporting: Workloads automatically inherit improvements
  • Maintained References: Terraform’s powerful dependency graph still works (via smart lookups)
  • Escape Hatch: Teams can still use raw Terraform resources when needed for edge cases

The Innovation: A pure Terraform framework that:

  • Uses colon-separated syntax (s3:bucket_name) for resource references
  • Resolves lookups dynamically using native Terraform expressions
  • Abstracts AWS complexity through opinionated building blocks
  • Works seamlessly with Terraform Cloud and standard workflows
  • Updates centrally but applies individually

Coverage:

  • Handles 90-95% of common workload patterns through building blocks
  • Allows raw Terraform resources alongside building blocks for edge cases
  • Manages N×N complexity (lookups between all resource types)

The Result:

  • Platform team maintains the framework (1 codebase)
  • 50 teams write simple configurations (50 tfvars files)
  • Everyone benefits from continuous improvement
  • No preprocessing, no code generation, pure Terraform

Lifecycle Management: Keeping Up With Scale

The Separation Strategy:

The framework separates two concerns that evolve at different speeds:

  1. Configuration (Team-Owned): What workload resources exist

    • Lives in team repositories as .tfvars files
    • Teams control: which Lambda, what S3 buckets, environment variables
    • Changes infrequently (when application requirements change)
  2. Implementation (Platform-Owned): How resources are created

    • Lives in blueprint repository as managed_by_dp_*.tf files
    • Platform controls: security policies, naming, encryption, monitoring
    • Changes frequently (as patterns improve)

The Update Process:

When the platform team improves patterns (add VPC support, update KMS policies, new monitoring):

Terminal window
# Platform team's workflow
cd blueprint-repository
# Update building block versions, add new features
git commit -m "feat: add X-Ray tracing to Lambda building block"
# Generate PRs for all 50 team repositories
./tools/repo_updater.py --update-all-teams
# Result: 50 automated PRs created
# Each PR updates only managed_by_dp_*.tf files
# Teams' tfvars files are NEVER touched

Team’s Approval Workflow:

Terminal window
# Team receives automated PR: "Update platform code to v2.5"
# PR shows ONLY changes to managed_by_dp_*.tf files
# Team's _project.auto.tfvars is unchanged
# Team reviews terraform plan in PR comments
terraform plan
# Shows: "Lambda function will be updated in-place"
# " + vpc_config { ... }" (new VPC configuration added)
# Team approves and merges
# Terraform Cloud runs terraform apply
# Workload gets new feature automatically

The Math Works:

  • Without this approach: 50 teams × 10 resource types × 5 improvements/year = 2,500 manual updates
  • With this approach: 1 platform team × 1 script × 50 automated PRs = 50 team approvals (30 minutes each)

Platform team scales from:

  • 10 person-weeks of manual updates (touching every team’s code)
  • To: 2 person-days (writing script, reviewing automation)

Teams benefit:

  • Receive improvements without doing any work
  • Review and approve changes (maintain control)
  • terraform plan shows exactly what changes
  • Rollback is just reverting the PR

Key Principles:

  1. Teams own configuration: Platform can’t break their workload definitions
  2. Platform owns implementation: Teams benefit from continuous improvement
  3. Automation bridges scale: Scripts generate PRs, teams approve
  4. Terraform validates: Standard plan shows changes before apply
  5. Gradual rollout: Platform can update 5 teams first, validate, then roll to 45 more

This lifecycle separation is what makes the framework sustainable at scale—platform team doesn’t become a bottleneck, teams maintain velocity, everyone stays current with best practices.

TL;DR - Section 1: Platform teams face N×N complexity when updating 50+ workloads with infrastructure improvements. This framework separates configuration (team-owned tfvars) from implementation (platform-owned building blocks). Automated PR generation scales updates: platform improves once, all teams inherit via terraform plan review and approval. Reduces 2,500 manual updates/year to 50 automated PRs.


2. Architecture Overview

┌────────────────────────────────────────────────────────────────────┐
│ Layer 1: tf-common (Shared Foundation) │
├────────────────────────────────────────────────────────────────────┤
│ • Provider Config • Naming Conventions │
│ • VPC/Subnet Data Sources • Platform Info Provider │
└──────────────────┬─────────────────────────────────────────────────┘
┌────────────────────────────────────────────────────────────────────┐
│ Layer 2: tf-default (Account-Level) │
├────────────────────────────────────────────────────────────────────┤
│ • KMS Infrastructure Key • S3 Code/Logging Buckets │
│ • IAM Admin Roles • CloudTrail Data │
└──────────────────┬────────────────┬────────────────────────────────┘
│ │
│ (Shared KMS) │ (Code Storage)
▼ ▼
┌────────────────────────────────────────────────────────────────────┐
│ Layer 3: tf-project (Application-Level) │
├────────────────────────────────────────────────────────────────────┤
│ • KMS Data Key • S3 Data Buckets │
│ • Lambda/Glue/Fargate • RDS/Redshift/DynamoDB │
└────────────────────────────────────────────────────────────────────┘

The Three-Layer System

Layer 1: tf-common (Shared Foundation)

  • Provider configuration
  • Naming conventions and context management
  • Shared data sources (VPC, subnets, IAM roles)
  • Platform Information Provider (PIP) integration
  • Used by ALL workloads (updated centrally)

Layer 2: tf-default (Account-Level Resources)

  • S3 code/logging buckets
  • KMS infrastructure keys
  • Lake Formation settings
  • IAM admin roles
  • CloudTrail data logging
  • Deployed ONCE per AWS account

Layer 3: tf-project (Application Resources)

  • S3 data buckets
  • Lambda functions, Glue jobs
  • RDS, Redshift, DynamoDB databases
  • Fargate containers
  • Application-specific KMS keys
  • Deployed MULTIPLE times per account (one per workload)

Composition via Symlinks:

examples/my-workload/
├── _data.tf # User-owned: environment config
├── _project.auto.tfvars # User-owned: workload definition
├── managed_by_dp_common_*.tf -> ../../tf-common/terraform/
├── managed_by_dp_default_*.tf -> ../../tf-default/terraform/
└── managed_by_dp_project_*.tf -> ../../tf-project/terraform/

This creates a complete, runnable Terraform project where terraform plan/apply work directly.


3. The Smart Lookup Innovation

The Core Concept

Traditional Terraform:

lambda_functions = {
processor = {
environment = {
BUCKET = "arn:aws:s3:::company-prod-data-raw-bucket-a1b2c3"
}
policy_json = jsonencode({
Statement = [{
Effect = "Allow"
Action = ["s3:GetObject", "s3:PutObject"]
Resource = "arn:aws:s3:::company-prod-data-raw-bucket-a1b2c3/*"
}]
})
}
}

With Smart Lookups:

s3_buckets = {
raw_data = { name = "raw" }
}
lambda_functions = {
processor = {
environment = {
BUCKET = "s3:raw_data" # Resolves to bucket name
}
permissions = {
s3_read = ["raw_data"] # Resolves to full ARN + generates IAM policy
}
}
}

How It Works: Pure Terraform Magic

Location: tf-project/terraform/managed_by_dp_project_lookup.tf

Step 1: Build Lookup Maps

The system creates hierarchical lookup maps after resources are created:

lookup_arn_base = merge(var.lookup_arns, {
"s3_read" = { for item in keys(var.s3_buckets) : item => module.s3_buckets[item].arn }
"s3_write" = { for item in keys(var.s3_buckets) : item => module.s3_buckets[item].arn }
"gluejob" = { for item in keys(var.glue_jobs) : item => module.glue_jobs[item].arn }
"secret_read" = { for item in keys(var.secrets) : item => module.secrets[item].arn }
"dynamodb_read" = { for item in keys(var.dynamodb_databases) : item => module.dynamodb[item].arn }
})
lookup_id_base = merge(var.lookup_ids, {
"s3" = { for item in keys(var.s3_buckets) : item => module.s3_buckets[item].id }
"secret" = { for item in keys(var.secrets) : item => module.secrets[item].id }
"dynamodb" = { for item in keys(var.dynamodb_databases) : item => module.dynamodb[item].name }
})

Step 2: Resolve References Dynamically

In building block modules (e.g., managed_by_dp_project_lambda.tf):

module "lambda" {
for_each = var.lambda_functions
# Environment variables with smart lookup
environments = {
for type, item in try(each.value.environment, {}) : type =>
try(
local.lookup_id_lambda[split(":", item)[0]][split(":", item)[1]],
item # Fallback to literal value if not a lookup
)
}
# Permissions with smart lookup
permissions = {
for type, items in try(each.value.permissions, {}) : type => [
for item in items :
(
length(split(":", item)) == 2 # Check if it's "type:name" format
? try(
local.lookup_perm_lambda[split(":", item)[0]][split(":", item)[1]],
item
)
: try(
local.lookup_perm_lambda[type][item], # Infer type from permission category
item
)
)
]
}
}

The Magic:

  • split(":", "s3:mybucket")["s3", "mybucket"]
  • local.lookup_id_lambda["s3"]["mybucket"] → actual bucket name
  • local.lookup_perm_lambda["s3_read"]["mybucket"] → actual bucket ARN

Step 3: Building Blocks Generate IAM Policies

Building block modules (from Terraform Cloud private registry) automatically generate IAM policies:

module "lambda" {
source = "app.terraform.io/org/buildingblock-lambda/aws"
version = "3.2.0"
permissions = {
s3_read = ["arn:aws:s3:::bucket1", "arn:aws:s3:::bucket2"]
}
create_policy = true # Automatically generates IAM role + policy
}

Inside the building block, it generates:

data "aws_iam_policy_document" "lambda" {
statement {
sid = "S3Read"
effect = "Allow"
actions = ["s3:GetObject*", "s3:GetBucket*", "s3:List*"]
resources = flatten([
var.permissions.s3_read,
[for arn in var.permissions.s3_read : "${arn}/*"]
])
}
}

Supported Lookup Types

For Environment Variables (IDs/Names):

  • s3:bucket_name → S3 bucket name
  • secret:secret_name → Secrets Manager secret ID
  • dynamodb:table_name → DynamoDB table name
  • athena:workgroup_name → Athena workgroup name
  • prefix:suffix → Injects naming prefix + suffix

For Permissions (ARNs):

  • s3_read:bucket / s3_write:bucket → S3 bucket ARN
  • gluejob:job_name → Glue job ARN
  • gluedb:database_name → Glue database name
  • secret_read:secret_name → Secrets Manager ARN
  • dynamodb_read:table / dynamodb_write:table → DynamoDB ARN
  • sqs_read:queue / sqs_send:queue → SQS queue ARN
  • sns_pub:topic → SNS topic ARN

Cross-Account References:

  • acct_prod_glue_tables → All Glue tables in production account
  • acct_dev_kms_all_keys → All KMS keys in dev account
Team tfvars Lookup Tables Building Block AWS Resources
│ │ │ │
│ environment = │ │ │
│ {BUCKET="s3:raw"} │ │ │
├────────────────────>│ │ │
│ │ split(":", "s3:raw")│ │
│ │ → ["s3", "raw"] │ │
│ │ │ │
│ │ lookup_id_lambda │ │
│ │ ["s3"]["raw"] → │ │
│ │ "company...-raw" │ │
│ ├────────────────────>│ │
│ │ resolved name │ │
│ │ │ Create Lambda with │
│ │ │ env BUCKET= │
│ │ │ "company...-raw" │
│ │ ├────────────────────>│
│ │ │ │
│ permissions = │ │ │
│ {s3_read=["raw"]} │ │ │
├────────────────────>│ │ │
│ │ lookup_perm_lambda │ │
│ │ ["s3_read"]["raw"] │ │
│ │ → arn:aws:s3:::... │ │
│ ├────────────────────>│ │
│ │ resolved ARN │ │
│ │ │ Generate IAM policy │
│ │ │ with S3 read actions│
│ │ │ │
│ │ │ Attach policy to │
│ │ │ Lambda role │
│ │ ├────────────────────>│

TL;DR - Section 3: Smart lookups use colon syntax (s3:bucket_name) resolved via native Terraform split() and lookup maps. No preprocessing—pure Terraform expressions. Lookup tables are built after resources are created, then referenced by building blocks to resolve environment variables (IDs) and permissions (ARNs). Building blocks auto-generate IAM policies from the resolved ARNs.


4. Building Block Abstraction

The Philosophy

Building blocks are opinionated Terraform modules that:

  1. Enforce organizational standards (naming, tagging, encryption)
  2. Abstract AWS complexity (IAM policies, VPC configuration)
  3. Provide guardrails (prevent common misconfigurations)
  4. Enable least-privilege by default (automatic policy generation)

Example: S3 Building Block

User Configuration (tfvars):

s3_buckets = {
raw_data = {
name = "raw"
backup = true
enable_intelligent_tiering = true
}
processed = {
name = "processed"
lifecycle_rules = [{
id = "archive_old_data"
transition_days = 90
storage_class = "GLACIER"
}]
}
}

What the Building Block Does:

module "s3_buckets" {
source = "app.terraform.io/org/buildingblock-s3/aws"
version = "2.1.3"
for_each = var.s3_buckets
# Standardized naming: <prefix>-<workload>-<application>-<name>
prefix = local.prefix # e.g., "companyp" (company + production)
context = local.context # {Env: "prd", Workload: "analytics", Application: "etl"}
name = try(each.value.name, each.key)
# Automatic encryption with workload KMS key
kms_key_arn = local.kms_data_key_arn
# Standardized tags (injected automatically)
# Tags include: Env, Workload, Application, Team, CostCenter, BIVC, CIA, Backup
# Security defaults
block_public_access = true
versioning_enabled = true
# User-specified configuration
backup = each.value.backup
lifecycle_rules = try(each.value.lifecycle_rules, [])
enable_intelligent_tiering = try(each.value.enable_intelligent_tiering, false)
}

Generated Resources:

  • S3 bucket with predictable name: companyprd-analytics-etl-raw
  • KMS encryption enabled automatically
  • Bucket policy restricting to VPC endpoints
  • CloudWatch alarms for bucket size
  • Backup plan (if backup = true)
  • All organizational tags applied

Example: Lambda Building Block

User Configuration:

lambda_functions = {
data_processor = {
name = "processor"
handler = "index.handler"
runtime = "python3.13"
memory = 1024
timeout = 300
s3_sourcefile = "s3_file:lambda_processor.zip"
environment = {
INPUT_BUCKET = "s3:raw_data"
OUTPUT_BUCKET = "s3:processed"
SECRET_ID = "secret:db_creds"
}
permissions = {
s3_read = ["raw_data"]
s3_write = ["processed"]
secret_read = ["db_creds"]
}
}
}

What the Building Block Does:

  • Creates Lambda function with standardized name
  • Generates IAM role automatically
  • Generates IAM policy from permissions map
  • Applies permission boundary (security compliance)
  • Injects VPC configuration (subnet IDs, security groups)
  • Resolves environment variables via lookup tables
  • Adds CloudWatch log group with retention policy
  • Applies X-Ray tracing
  • Adds all organizational tags

Generated IAM Policy (automatically):

{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "S3Read",
"Effect": "Allow",
"Action": ["s3:GetObject*", "s3:GetBucket*", "s3:List*"],
"Resource": [
"arn:aws:s3:::companyprd-analytics-etl-raw",
"arn:aws:s3:::companyprd-analytics-etl-raw/*"
]
},
{
"Sid": "S3Write",
"Effect": "Allow",
"Action": ["s3:PutObject*", "s3:DeleteObject*"],
"Resource": [
"arn:aws:s3:::companyprd-analytics-etl-processed",
"arn:aws:s3:::companyprd-analytics-etl-processed/*"
]
},
{
"Sid": "SecretRead",
"Effect": "Allow",
"Action": ["secretsmanager:GetSecretValue"],
"Resource": "arn:aws:secretsmanager:eu-central-1:123456789012:secret:companyprd-analytics-etl-db_creds-a1b2c3"
},
{
"Sid": "KMSDecrypt",
"Effect": "Allow",
"Action": ["kms:Decrypt"],
"Resource": "arn:aws:kms:eu-central-1:123456789012:key/abcd1234-..."
}
]
}

5. Dual KMS Key Architecture with Tag-Based Permissions

One of the most elegant security features of this framework is its dual KMS key architecture that balances security isolation with operational flexibility.

The Two-Key System

KMS Infrastructure Key (kms-infra)

  • Scope: One per AWS account (shared across all workloads in that account)
  • Location: Created in tf-default (account-level)
  • Purpose: Encrypts infrastructure resources (CloudWatch Logs, Secrets Manager, SNS, CloudTrail)
  • Naming: ${prefix}-${workload}-kms-infra
  • Example: companyp-analytics-kms-infra

KMS Data Key (kms-data)

  • Scope: One per workload (isolated per application)
  • Location: Created in tf-project (application-level)
  • Purpose: Encrypts data resources (S3 buckets, RDS, DynamoDB, Redshift)
  • Naming: ${prefix}-${workload}-${application}-kms-data
  • Example: companyp-analytics-etl-kms-data

Why Two Keys?

Security Isolation:

  • Data keys are isolated per workload
  • Compromising one workload’s data key doesn’t expose other workloads’ data
  • Infrastructure key is shared for operational resources that need account-wide access

Operational Flexibility:

  • Infrastructure key allows CloudWatch, monitoring, and logging to work across workloads
  • AWS services (Secrets Manager, CloudTrail) can use a single key for account-level operations
  • Data keys remain tightly scoped to application resources

Cost Optimization:

  • Infrastructure resources share one key (CloudWatch logs from many workloads)
  • Only data resources (S3, databases) need separate keys per workload

Tag-Based Permissions: The Magic Sauce

Instead of explicitly listing every IAM role in the KMS key policy (which creates circular dependencies), the infrastructure key uses tag-based permissions:

Implementation in managed_by_dp_common_kms_infra.tf:

module "kms_infrastructure" {
source = "terraform-aws-modules/kms/aws"
create = local.default_deploy # Only in default/account deployment
aliases = ["${local.prefix}-${local.context.Workload}-kms-infra"]
key_statements = [
{
sid = "tag-workload"
principals = [{
type = "AWS"
identifiers = ["arn:aws:iam::${account_id}:root"]
}]
actions = [
"kms:Encrypt*",
"kms:Decrypt*",
"kms:ReEncrypt*",
"kms:GenerateDataKey*",
]
resources = ["*"]
# The key condition: any role with matching Workload tag can use this key
conditions = [{
test = "StringEquals"
variable = "aws:PrincipalTag/Workload"
values = [local.context.Workload]
}]
}
]
}

How It Works:

  1. Every IAM role created by building blocks gets tagged automatically:

    # Lambda IAM role
    tags = {
    Workload = "analytics"
    Application = "etl"
    Env = "prd"
    }
  2. KMS key policy allows any role with matching Workload tag:

    • If role has tag Workload = "analytics"
    • And KMS key is for workload analytics
    • Then role can use the key automatically
  3. No circular dependencies:

    • KMS key doesn’t need to know about Lambda roles
    • Lambda roles don’t need to be in KMS key policy
    • Tag matching happens at runtime by AWS IAM

Data Key: Explicit Role Lists

The data key uses a different approach with explicit role lists (avoiding circular dependencies through selective inclusion):

Implementation in managed_by_dp_project_kms_data.tf:

module "kms_data" {
source = "app.terraform.io/org/buildingblock-kms-data/aws"
key_administrators = local.kms_admins
key_users = compact(concat(local.kms_data_key_users, var.kms_data["extra_roles"]))
# Tag-based access for roles with matching tags
key_user_tag_map = {
"Workload" = local.context.Workload
"Application" = local.context.Application
"Env" = local.context.Env
}
}

In managed_by_dp_project_locals.tf:

kms_data_key_users = compact(concat(
# Admin roles (explicitly listed)
["arn:aws:iam::${account_id}:role/${var.role_prefix}-${local.prefix}-DpAdminRole"],
[local.operatorrole_arn],
local.transfer_roles,
local.workflow_roles,
# Lambda, Glue, Fargate roles are NOT listed here (would cause cycles)
# Instead, they're granted access via tag-based permissions
# See comments in code explaining the circular dependency:
# [for job in var.glue_jobs : "arn:aws:iam::..."], # CYCLO ERROR!
# [for function in var.lambda_functions : "arn:aws:iam::..."], # CYCLO ERROR!
))

The data key also supports tag-based access through key_user_tag_map, allowing Lambda/Glue/Fargate roles to access it via their tags without being explicitly listed in the policy.

Practical Example

Scenario: Lambda function needs to:

  • Read encrypted S3 data (data key)
  • Write to CloudWatch Logs (infra key)
  • Access Secrets Manager secret (infra key)

What Happens:

  1. Lambda IAM role is created with tags:

    resource "aws_iam_role" "lambda" {
    name = "app-companyp-analytics-etl-lambda-processor"
    tags = {
    Workload = "analytics"
    Application = "etl"
    Env = "prd"
    }
    }
  2. Lambda can use infrastructure key because:

    • Role has tag Workload = "analytics"
    • KMS infra key checks: aws:PrincipalTag/Workload == "analytics"
    • Access granted for CloudWatch Logs, Secrets Manager
  3. Lambda can use data key because:

    • Role has tags Workload = "analytics" AND Application = "etl" AND Env = "prd"
    • KMS data key checks all three tags match ✓
    • Access granted for S3 data encryption/decryption
  4. Lambda CANNOT use another workload’s data key:

    • Role has Application = "etl"
    • Other workload’s data key requires Application = "reporting"
    • Tag mismatch ✗
    • Access denied

Benefits of This Architecture

1. Automatic Compliance:

  • Every resource is encrypted (mandatory KMS keys injected by building blocks)
  • No way to accidentally create unencrypted resources

2. Zero-Touch Security:

  • Developers never manage KMS permissions manually
  • Building blocks inject the correct KMS key ARN automatically
  • Tag propagation handles access control

3. Workload Isolation:

  • Data from different applications is cryptographically separated
  • Even with compromised IAM credentials, cross-workload data access is prevented

4. Solves Circular Dependencies:

  • KMS keys don’t reference IAM roles directly
  • IAM roles don’t need to be created before KMS keys
  • Tag-based conditions evaluated at runtime

5. Audit Trail:

  • CloudTrail logs show which role (with which tags) accessed which KMS key
  • Security teams can verify tag-based access patterns
  • Compliance reports show encryption coverage

Service-Specific Access

The infrastructure key also includes service-specific statements for AWS services:

CloudWatch Logs:

{
sid = "logs"
principals = [{ type = "Service", identifiers = ["logs.amazonaws.com"] }]
actions = ["kms:Encrypt*", "kms:Decrypt*", "kms:GenerateDataKey*"]
conditions = [{
test = "ArnEquals"
variable = "kms:EncryptionContext:aws:logs:arn"
values = ["arn:aws:logs:${region}:${account}:log-group:*"]
}]
}

Secrets Manager:

{
sid = "auto-secretsmanager"
principals = [{ type = "Service", identifiers = ["secretsmanager.amazonaws.com"] }]
actions = ["kms:Encrypt", "kms:Decrypt", "kms:GenerateDataKey"]
conditions = [
{ test = "StringEquals", variable = "kms:ViaService",
values = ["secretsmanager.${region}.amazonaws.com"] },
{ test = "StringEquals", variable = "kms:CallerAccount", values = ["${account}"] }
]
}

CloudTrail, SNS, EventBridge: Similar service-specific statements allow these AWS services to use the infrastructure key for their operations.

Lookup References

Both keys are available via smart lookups:

# In Lambda/Glue/Fargate tfvars - use data key for data encryption
permissions = {
kms = ["kms_data"] # Resolves to workload's data key ARN
}
# Infrastructure key is injected automatically by building blocks
# (for CloudWatch Logs, environment variable encryption, etc.)

Summary

The dual KMS key architecture demonstrates how thoughtful design can achieve:

  • Security: Strong encryption and workload isolation
  • Developer Experience: Zero manual KMS management
  • Operational Simplicity: Tag-based permissions eliminate complexity
  • Compliance: Automatic encryption enforcement across all resources

This pattern is a cornerstone of the framework’s security model and showcases how infrastructure abstractions can enhance rather than compromise security posture.

┌──────────────────────────────────────────────────────────────────┐
│ KMS Infrastructure Key (Account-Level) │
├──────────────────────────────────────────────────────────────────┤
│ • One Key Per Account │
│ • Encrypts: CloudWatch Logs, Secrets Manager, SNS, CloudTrail │
│ • Tag-Based Access: Workload Tag │
└────────────────────────────────┬─────────────────────────────────┘
│ (Tag Match: Workload)
┌──────────┴──────────┐
│ │
│ Lambda Role │
│ Tagged with: │
│ • Workload=analytics│
│ • Application=etl │
│ • Env=prd │
│ │
└──────────┬──────────┘
│ (Tag Match: All 3 Tags)
┌────────────────────────────────▼─────────────────────────────────┐
│ KMS Data Key (Workload-Level) │
├──────────────────────────────────────────────────────────────────┤
│ • One Key Per Workload │
│ • Encrypts: S3, RDS, DynamoDB, Redshift │
│ • Tag-Based Access: Workload + Application + Env │
└──────────────────────────────────────────────────────────────────┘

TL;DR - Section 5: Dual KMS architecture uses one shared infrastructure key per account (CloudWatch, Secrets Manager) and one data key per workload (S3, databases). Tag-based permissions solve circular dependencies: IAM roles tagged with Workload/Application/Env automatically gain KMS access without being explicitly listed in policies. Infrastructure key checks one tag, data key checks three tags for stronger isolation.


6. Naming Conventions and Context Propagation

The Context System

Input: Tags Module

Every workload defines a tags module:

module "tags" {
source = "app.terraform.io/org/tags/aws"
version = "~> 1.0.0"
environment = "prd"
workload = "analytics"
application = "etl"
bivc = "1234"
cia = "123"
costcenter = "12345"
backup = "Daily"
}

Output: Context Map

context = merge(module.tags.tags, var.context)
# Result: {
# Env: "prd",
# Workload: "analytics",
# Application: "etl",
# BIVC: "1234",
# CIA: "123",
# CostCenter: "12345",
# Backup: "Daily"
# }

Prefix Generation

prefix = "company${substr(local.context.Env, 0, 1)}"
# prd → companyp
# sbx → companys
# dev → companyd

Resource Naming Pattern

${prefix}-${workload}-${application}-${resource_name}

Examples:

  • S3 bucket: companyp-analytics-etl-raw
  • Lambda: companyp-analytics-etl-processor
  • Glue job: companyp-analytics-etl-transform
  • IAM role: companyp-analytics-etl-lambda-processor-role

Benefits:

  • Predictable: Resources can be referenced before creation
  • Discoverable: Name reveals environment, workload, and purpose
  • Compliant: Meets organizational naming standards
  • Unique: Prevents naming collisions across teams

7. Circular Dependency Resolution Strategies

The Challenge

Terraform dependency graph requires acyclic relationships, but real-world infrastructure often has circular references:

  • Lambda needs IAM role ARN
  • IAM role policy needs Lambda ARN for trust policy
  • KMS key policy needs Lambda role ARN
  • Lambda needs KMS key ARN for environment variables

Strategy 1: Predictive Naming

Example: Redshift Lookup

# Can't use module.redshift[item].name because it creates a cycle
# CYCLO ERROR! comment in code
"redshift_data" = {
for item in keys(var.redshift_databases) :
item => join("-", [
local.prefix,
local.context.Workload,
local.context.Application,
item
])
}

Instead of referencing the module output (which creates a dependency), predict the name using the same naming convention.

Strategy 2: Two-Phase Deployment

From DEPLOY.md:

“First Terraform apply will fail on a few dependencies. Re-run to finalize.”

Some circular dependencies are resolved by applying twice:

  1. First apply creates base resources
  2. Some resources fail due to missing dependencies
  3. Second apply completes configuration

Strategy 3: Selective KMS Key Users

kms_data_key_users = compact(concat(
["arn:aws:iam::${account_id}:role/${var.role_prefix}-${local.prefix}-DpAdminRole"],
[local.operatorrole_arn],
local.transfer_roles,
local.workflow_roles,
# These would create cycles - commented out:
# [for job in var.glue_jobs : "arn:aws:iam::..."],
# [for function in var.lambda_functions : "arn:aws:iam::..."],
))

KMS key policies include predictable roles (admin, operator) but NOT Lambda/Glue roles to avoid cycles.

Strategy 4: Data Source Lookups (Cross-Workload)

When project workloads need resources from the default workload:

local.default_deploy = fileexists("${path.module}/managed_by_dp_default_s3_code.tf")
data "aws_kms_key" "kms_infrastructure" {
count = local.default_deploy ? 0 : 1
key_id = "alias/${local.prefix}-${local.context.Workload}-kms-infra"
}
kms_infrastructure_key_arn = coalesce(
module.kms_infrastructure.key_arn, # If default deploy
data.aws_kms_key.kms_infrastructure[0].arn # If project deploy
)

Project workloads use data sources to look up infrastructure key by predictable alias.


8. Real-World Example: Data Pipeline Workload

Scenario

Build a data pipeline that:

  1. Ingests raw CSV files from external S3 bucket
  2. Processes files with Lambda function
  3. Transforms data with Glue ETL job
  4. Stores in Redshift for analytics
  5. Shares Glue catalog with data governance account

Configuration (tfvars)

# Define S3 buckets
s3_buckets = {
raw = {
name = "raw"
backup = true
lifecycle_rules = [{
id = "archive_old"
transition_days = 90
storage_class = "GLACIER"
}]
}
processed = {
name = "processed"
enable_intelligent_tiering = true
}
}
# Upload Lambda code
s3_source_files = {
processor_code = {
source = "lambda_processor.zip"
target = "lambda_functions/processor/code.zip"
}
glue_script = {
source = "transform.py"
target = "glue_jobs/transform/script.py"
}
}
# Define secrets
secrets = {
redshift_creds = {
name = "redshift-credentials"
secret_string = {
username = "admin"
password = "changeme" # Should use AWS Secrets Manager UI to set
}
}
}
# Define Glue database
glue_database = {
analytics = {
name = "analytics"
bucket = "s3:processed"
enable_lakeformation = true
share_cross_account_ro = ["datagovernance"]
}
}
# Define Lambda processor
lambda_functions = {
csv_processor = {
name = "csv-processor"
description = "Processes incoming CSV files"
handler = "index.handler"
runtime = "python3.13"
memory = 2048
timeout = 900
s3_sourcefile = "s3_file:processor_code"
environment = {
RAW_BUCKET = "s3:raw"
PROCESSED_BUCKET = "s3:processed"
GLUE_DATABASE = "gluedb:analytics"
}
permissions = {
s3_read = ["raw"]
s3_write = ["processed"]
glue_update = ["analytics"]
}
# S3 trigger
event_source_mapping = [{
event_source_arn = "s3:raw"
events = ["s3:ObjectCreated:*"]
filter_prefix = "incoming/"
filter_suffix = ".csv"
}]
}
}
# Define Glue ETL job
glue_jobs = {
transform = {
name = "data-transform"
glue_version = "4.0"
worker_type = "G.1X"
number_of_workers = 5
script_location = "s3_file:glue_script"
arguments = {
"--DATABASE" = "gluedb:analytics"
"--INPUT_BUCKET" = "s3:processed"
"--REDSHIFT_SECRET" = "secret:redshift_creds"
}
permissions = {
s3_read = ["processed"]
glue_update = ["analytics"]
secret_read = ["redshift_creds"]
redshift = ["analytics_cluster"]
}
# Scheduled trigger
trigger_type = "SCHEDULED"
schedule = "cron(0 2 * * ? *)" # Daily at 2 AM
}
}
# Define Redshift cluster
redshift_databases = {
analytics_cluster = {
name = "analytics"
node_type = "dc2.large"
number_of_nodes = 2
master_username = "admin"
secret_name = "secret:redshift_creds"
permissions = {
glue_read = ["analytics"]
s3_read = ["processed"]
}
}
}

What Gets Created (40+ AWS Resources)

Infrastructure:

  • KMS data key for encryption
  • VPC security groups for Lambda/Glue
  • IAM roles (5): Lambda role, Glue role, Redshift role, Lake Formation role, Admin role
  • IAM policies (5): Auto-generated least-privilege policies
  • Permission boundaries (2): For Lambda and Glue roles

Storage:

  • S3 bucket: companyp-analytics-pipeline-raw
  • S3 bucket: companyp-analytics-pipeline-processed
  • S3 bucket policies (2)
  • S3 lifecycle rules
  • S3 intelligent tiering configuration

Compute:

  • Lambda function: companyp-analytics-pipeline-csv-processor
  • Lambda log group with 30-day retention
  • S3 event notification trigger
  • Glue job: companyp-analytics-pipeline-data-transform
  • Glue security configuration
  • Glue CloudWatch log group

Data Catalog:

  • Glue database: companyp-analytics-pipeline-analytics
  • Lake Formation permissions
  • Lake Formation resource link (cross-account share)
  • RAM resource share (for cross-account access)

Database:

  • Redshift cluster: companyp-analytics-pipeline-analytics
  • Redshift subnet group
  • Redshift parameter group
  • Redshift security group
  • Secrets Manager secret: companyp-analytics-pipeline-redshift-credentials
  • Secret rotation configuration

Monitoring:

  • CloudWatch alarms (6): Lambda errors, Glue job failures, S3 metrics
  • CloudWatch log groups (3)
  • EventBridge rule for Glue job schedule

All with:

  • Consistent naming
  • Full encryption (KMS)
  • Least-privilege IAM policies
  • Organizational tags
  • VPC isolation
  • CloudWatch logging

Total Configuration: ~150 lines of tfvars Generated Terraform Code: ~2000+ lines (via building blocks) Boilerplate Reduction: ~93%

┌──────────────────────────────────┐
│ S3 Buckets │
│ ┌────────┐ ┌────────┐ │
│ │ raw │ │processed│ │
│ └───┬────┘ └────▲───┘ │
└──────┼────────────────┼──────────┘
│ │
S3 Event │ │
Trigger │ │ Writes
│ │
┌──────▼────────────────┴──────────┐
│ Lambda │
│ ┌─────────────────────┐ │
│ │ csv-processor │ │
│ └──────────┬──────────┘ │
└─────────────┼────────────────────┘
│ Updates
┌─────────────────────────▼───────────────────────┐
│ Glue │
│ ┌──────────────────┐ ┌──────────────────┐ │
│ │ Database: │◄───│ ETL Job: │ │
│ │ analytics │ │ transform │ │
│ └────────▲─────────┘ └────┬─────────────┘ │
└───────────┼──────────────────┼─────────────────┘
│ │
│ Queries │ Loads
│ │
┌───────────┴──────────────────▼─────────────────┐
│ Redshift │
│ ┌─────────────────────┐ │
│ │ Cluster: analytics │ │
│ └──────────┬──────────┘ │
└─────────────┼────────────────────────────────────┘
│ Reads
┌─────────────▼────────────────────────────────────┐
│ Secrets Manager │
│ ┌─────────────────────────┐ │
│ │ redshift-credentials │ │
│ └─────────────────────────┘ │
└──────────────────────────────────────────────────┘

TL;DR - Section 8: Real-world data pipeline example shows how 150 lines of tfvars configuration generates 40+ AWS resources (S3, Lambda, Glue, Redshift, KMS, IAM, CloudWatch). Smart lookups connect resources (s3:raw, secret:db_creds), building blocks auto-generate IAM policies, context system applies consistent naming/tagging, and KMS keys encrypt everything automatically. Achieves 93% boilerplate reduction vs traditional Terraform.


9. Cross-Account Architecture

Use Case: Multi-Account Data Mesh

Scenario: Analytics workload in Production account needs to:

  • Read S3 data from Development account
  • Query Glue tables from Staging account
  • Use KMS keys from Shared Services account

Configuration

Step 1: Define Cross-Account Aliases

cross_accounts = {
dev = "123456789012"
staging = "234567890123"
shared = "345678901234"
}

Step 2: Define External S3 Buckets

lookup_ids = {
xa_s3_bucket = {
dev_raw = "dev-shared-raw-data"
staging_processed = "staging-shared-processed"
}
}

Step 3: Use Cross-Account Lookups

lambda_functions = {
cross_account_reader = {
name = "reader"
permissions = {
# Read from external S3 buckets
s3_read = ["dev_raw", "staging_processed"]
# Query Glue tables in staging account
glue_read = ["acct_staging_glue_tables"]
# Use KMS keys in shared account
kms = ["acct_shared_kms_all_keys"]
}
}
}

Generated IAM Policy

{
"Statement": [
{
"Sid": "S3ReadCrossAccount",
"Effect": "Allow",
"Action": ["s3:GetObject*", "s3:GetBucket*", "s3:List*"],
"Resource": [
"arn:aws:s3:::dev-shared-raw-data",
"arn:aws:s3:::dev-shared-raw-data/*",
"arn:aws:s3:::staging-shared-processed",
"arn:aws:s3:::staging-shared-processed/*"
]
},
{
"Sid": "GlueReadCrossAccount",
"Effect": "Allow",
"Action": ["glue:GetTable", "glue:GetTables", "glue:GetDatabase"],
"Resource": "arn:aws:glue:*:234567890123:table/*"
},
{
"Sid": "KMSCrossAccount",
"Effect": "Allow",
"Action": ["kms:Decrypt", "kms:DescribeKey"],
"Resource": "arn:aws:kms:eu-central-1:345678901234:key/*"
}
]
}

Benefits:

  • Developers don’t need to know account IDs
  • Cross-account permissions follow same pattern as same-account
  • Centralized account alias management
  • Type-safe (Terraform validates references at plan time)

10. Deployment Workflow

Repository Structure

Blueprint Repository (Central):

terraform-platform-blueprint/
├── tf-common/ # Shared foundation
├── tf-default/ # Account-level resources
├── tf-project/ # Application resources
├── examples/
│ ├── full_test/ # Complete example
│ └── simple_example/ # Minimal example
└── tools/
└── repo_updater.py # Syncs blueprint to user repos

User Repository (Team-Owned):

team-analytics/
├── terraform/
│ ├── dev/
│ │ ├── tags.tf # Team owns
│ │ ├── _default.auto.tfvars # Team owns
│ │ ├── _project.auto.tfvars # Team owns
│ │ ├── managed_by_dp_common_*.tf # Synced from blueprint
│ │ ├── managed_by_dp_default_*.tf # Synced from blueprint
│ │ └── managed_by_dp_project_*.tf # Synced from blueprint
│ ├── staging/
│ └── production/
└── .github/
└── workflows/
└── terraform.yml

Workflow Steps

Step 1: Team Creates Configuration

Teams edit only their own files:

  • tags.tf - Defines environment, workload, application
  • _default.auto.tfvars - Account-level config (if first workload)
  • _project.auto.tfvars - Application resources

Step 2: Platform Team Updates Blueprint

When blueprint code needs updating:

Terminal window
# In blueprint repo
cd tools
python repo_updater.py --target ../../../team-analytics/terraform/dev

This syncs all managed_by_dp_*.tf files from blueprint to team repo.

Step 3: Team Commits and Pushes

Terminal window
git add .
git commit -m "feat: add data processing pipeline"
git push origin feature/data-pipeline

Step 4: Terraform Cloud Runs

GitHub Action triggers Terraform Cloud:

  1. Workspace detects VCS change
  2. Runs terraform plan
  3. Shows plan in pull request comment
  4. Team reviews and approves
  5. Merges PR
  6. Terraform Cloud runs terraform apply

Step 5: Resources Created

All AWS resources created with:

  • Standardized naming
  • Automatic IAM policies
  • Full encryption
  • Organizational tags
  • CloudWatch monitoring

No Preprocessing Required

This workflow uses standard Terraform:

  • No build step before terraform plan
  • No code generation at runtime
  • No wrapper scripts
  • Native .tfvars files
  • Standard state management
  • Compatible with Terraform Cloud, Enterprise, or OSS
Platform Blueprint repo_updater.py Team Repos Terraform Application
Team Repo (50+) Cloud Team
│ │ │ │ │ │
│ Update │ │ │ │ │
│ building │ │ │ │ │
│ blocks │ │ │ │ │
├───────────>│ │ │ │ │
│ │ │ │ │ │
│ git commit │ │ │ │ │
│ & push │ │ │ │ │
├───────────>│ │ │ │ │
│ │ │ │ │ │
│ Run │ │ │ │ │
│ --update- │ │ │ │ │
│ all-teams │ │ │ │ │
├────────────┼─────────────>│ │ │ │
│ │ │ Generate 50 PRs│ │ │
│ │ │ (update │ │ │
│ │ │ managed_by_dp) │ │ │
│ │ ├───────────────>│ │ │
│ │ │ │ PR triggers│ │
│ │ │ │ terraform │ │
│ │ │ │ plan │ │
│ │ │ ├──────────>│ │
│ │ │ │ │ │
│ │ │ │ Post plan │ │
│ │ │ │ as PR │ │
│ │ │ │ comment │ │
│ │ │ │<──────────┤ │
│ │ │ │ │ │
│ │ │ │ │ Review plan│
│ │ │ │<──────────────────────┤
│ │ │ │ │ │
│ │ │ │ Approve & │ │
│ │ │ │ merge PR │ │
│ │ │ │<──────────────────────┤
│ │ │ │ │ │
│ │ │ │ Merge │ │
│ │ │ │ triggers │ │
│ │ │ │ terraform │ │
│ │ │ │ apply │ │
│ │ │ ├──────────>│ │
│ │ │ │ │ │
│ │ │ │ Deploy │ │
│ │ │ │ updated │ │
│ │ │ │ resources │ │
│ │ │ │ │ │

11. Comparison with Other Approaches

vs. Standard Terraform

AspectStandard TerraformThis Framework
ARN ManagementManual ARN stringsSmart lookups (s3:bucket)
IAM PoliciesWrite JSON/HCL policy documentsAuto-generated from permissions map
NamingManually ensure consistencyAutomatic standardized naming
StandardsManually enforceBuilding blocks enforce automatically
Cross-referencesDirect resource dependenciesLookup tables (reduces coupling)
BoilerplateHigh (1000+ lines typical)Low (150 lines typical) - ~85% reduction
Learning CurveSteep (requires AWS expertise)Moderate (config-focused)

vs. Terragrunt

AspectTerragruntThis Framework
PreprocessingRequired (terragrunt run)None (native Terraform)
State ManagementSeparate toolNative Terraform
CompatibilityWrapper tool requiredStandard terraform CLI
DRY ApproachFile includes & remote stateLookup tables & modules
ComplexityAdditional tool layerPure Terraform
IDE SupportLimited (custom syntax)Full (standard HCL)

vs. Terraspace

AspectTerraspaceThis Framework
LanguageRuby DSL + ERB templatesPure HCL
PreprocessingRequired (terraspace build)None
RuntimeRuby interpreter neededNative Terraform only
ConfigurationERB templatingNative tfvars
ToolingAdditional CLI wrapperStandard Terraform CLI
Learning CurveLearn Ruby + TerraspaceLearn framework conventions

vs. Terraform CDK

AspectTerraform CDKThis Framework
LanguageTypeScript/Python/Java/C#/GoPure HCL
CompilationRequired (cdktf synth)None
RuntimeNode.js/Python runtimeNative Terraform only
ConfigurationImperative codeDeclarative tfvars
State InspectionVia generated JSONNative Terraform state
IDE SupportLanguage-specificTerraform-specific

Key Advantages of This Approach

  1. No External Dependencies: Pure Terraform, no additional tools
  2. Native Workflows: Works with Terraform Cloud, Enterprise, OSS
  3. Type Safety: Terraform validates references at plan time
  4. Version Control: Standard .tfvars files, readable diffs
  5. IDE Support: Full support from Terraform plugins
  6. Learning Curve: Lower (no new language/tool to learn)
  7. Portability: Standard Terraform state, no lock-in
  8. Debugging: Standard Terraform error messages and plan output
┌─────────────────────────┐
│ Terraform Approaches │
└────────────┬────────────┘
┌───────────┬───────────┼───────────┬───────────┐
│ │ │ │ │
▼ ▼ ▼ ▼ ▼
┌────────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────────┐
│ Standard │ │Terragrunt│ │Terraspace│ │Terraform │ │ This │
│ Terraform │ │ │ │ │ │ CDK │ │ Framework │
└─────┬──────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘ └──────┬───────┘
│ │ │ │ │
│Manual ARNs │Wrapper │Ruby DSL │TypeScript/ │Pure HCL
│High │tool │ERB │Python │Smart
│boilerplate │Preprocessing│templates │Compilation │lookups
│ │ │ │ │
▼ ▼ ▼ ▼ ▼
┌─────────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────────┐
│ 1000+ │ │terragrunt│ │terraspace│ │ cdktf │ │ 150 │
│ lines/ │ │ run │ │ build │ │ synth │ │ lines/ │
│ workload │ │ required │ │ required │ │ required │ │ workload │
│ │ │ │ │ │ │ │ │ ✓ │
└─────────────┘ └──────────┘ └──────────┘ └──────────┘ └──────────────┘

TL;DR - Section 11: This framework beats alternatives by using pure Terraform with zero preprocessing. Standard Terraform requires manual ARN management (1000+ lines). Terragrunt/Terraspace/CDK add preprocessing layers (wrapper tools, Ruby runtime, Node.js compilation). This approach achieves 85% boilerplate reduction through smart lookups and building blocks while maintaining full Terraform Cloud compatibility and native workflows.


12. Lessons Learned and Best Practices

What Worked Well

1. Colon Syntax is Intuitive

Developers adopted s3:bucket_name syntax immediately. It reads like natural configuration.

2. Building Blocks Enforce Standards

Opinionated modules ensure consistency without policing. Teams can’t accidentally create non-compliant resources.

3. Separation of Concerns

Platform team manages managed_by_dp_*.tf files, teams manage *.tfvars files. Clear ownership boundaries.

4. Lookup Tables Reduce Coupling

Resources don’t directly reference each other, reducing cascade changes when refactoring.

5. Predictive Naming Solves Most Circular Dependencies

Most cross-resource references can use naming conventions instead of module outputs.

Challenges and Solutions

Challenge 1: Circular Dependencies

Some resource relationships create cycles that Terraform can’t resolve.

Solutions:

  • Use predictive naming instead of module outputs
  • Two-phase deployment (apply twice)
  • Selective resource inclusion in policies
  • Data sources for cross-workload lookups

Challenge 2: Lookup Complexity

Lookup tables can become large and hard to maintain.

Solutions:

  • Organized into logical groups (lookup_perm_lambda, lookup_id_base)
  • Inline comments documenting purpose
  • Automated generation via for expressions
  • Cross-account lookups separated into _xa maps

Challenge 3: Building Block Versioning

Updating building block versions across many teams is coordination-heavy.

Solutions:

  • Semantic versioning with ~> constraints
  • Deprecation warnings for old versions
  • Automated testing of building block changes
  • Communication channel for breaking changes

Challenge 4: Developer Onboarding

New developers need to learn lookup syntax and conventions.

Solutions:

  • Comprehensive examples in blueprint repo
  • Detailed README with common patterns
  • IntelliSense/autocomplete via Terraform language server
  • Helper scripts to validate tfvars before commit

Best Practices

1. Use Descriptive Resource Keys

# Good
s3_buckets = {
raw_customer_data = { ... }
processed_analytics = { ... }
}
# Bad
s3_buckets = {
bucket1 = { ... }
bucket2 = { ... }
}

2. Group Related Resources

# Process: S3 → Lambda → Glue → Redshift
s3_buckets = { raw = {...}, processed = {...} }
lambda_functions = { processor = {...} }
glue_jobs = { transform = {...} }
redshift_databases = { analytics = {...} }

3. Use Comments to Document Intent

# Data pipeline for customer analytics
# Flow: External API → raw bucket → Lambda → processed bucket → Glue → Redshift
lambda_functions = {
api_ingestion = { ... }
}

4. Leverage Type Inference

# Instead of:
permissions = {
s3_read = ["s3_read:raw"]
}
# Prefer (type inferred from key):
permissions = {
s3_read = ["raw"]
}

5. Test in Lower Environments First

dev → staging → production

Use identical tfvars across environments, only changing tags.tf (environment name).

6. Version Pin Building Blocks

# Use pessimistic constraint
source = "app.terraform.io/org/buildingblock-lambda/aws"
version = "~> 3.2.0" # Allows 3.2.x, not 3.3.0

7. Document Cross-Account Access

# Cross-account: Read from Data Lake account
cross_accounts = {
datalake = "123456789012" # Managed by Data Lake team
}

13. Impact and Metrics

Development Velocity Improvements

Before This Framework:

  • ~1000 lines of Terraform per workload
  • 2-3 weeks to onboard new team
  • 5+ days to add new resource type
  • Frequent IAM permission errors
  • Inconsistent naming across teams
  • Manual policy review process

After This Framework:

  • ~150 lines of tfvars per workload (85% reduction)
  • 2-3 days to onboard new team
  • 1 day to add new resource type
  • Rare IAM errors (auto-generated policies)
  • Consistent naming (automatic)
  • Automated policy compliance

Code Quality Improvements

Reduction in Boilerplate:

Traditional approach (S3 + Lambda with IAM):

# ~250 lines for: S3 bucket, IAM role, IAM policy document,
# Lambda function, CloudWatch log group, etc.

This framework (same resources):

# ~30 lines of tfvars
s3_buckets = { data = { name = "data" } }
lambda_functions = {
processor = {
name = "processor"
permissions = { s3_read = ["data"] }
}
}

Boilerplate Reduction: ~88%

Governance and Compliance

Automatic Enforcement:

  • 100% of resources use standardized naming
  • 100% of resources encrypted with KMS
  • 100% of resources tagged per policy
  • 100% of IAM policies include permission boundaries
  • 100% of Lambda functions in VPC
  • 0 manual policy reviews required
Before Framework After Framework
┌────────────────────────────┐ ┌────────────────────────────┐
│ │ │ │
│ • 1000+ lines Terraform │─────>│ • 150 lines tfvars │
│ │ │ (85% reduction) │
│ │ │ │
└────────────────────────────┘ └────────────────────────────┘
┌────────────────────────────┐ ┌────────────────────────────┐
│ │ │ │
│ • 2-3 weeks onboarding │─────>│ • 2-3 days onboarding │
│ │ │ (5x faster) │
│ │ │ │
└────────────────────────────┘ └────────────────────────────┘
┌────────────────────────────┐ ┌────────────────────────────┐
│ │ │ │
│ • Manual IAM policies │─────>│ • Auto-generated IAM │
│ │ │ (Rare errors) │
│ │ │ │
└────────────────────────────┘ └────────────────────────────┘
┌────────────────────────────┐ ┌────────────────────────────┐
│ │ │ │
│ • Inconsistent naming │─────>│ • 100% consistent │
│ │ │ (Automatic compliance) │
│ │ │ │
└────────────────────────────┘ └────────────────────────────┘

TL;DR - Section 13: Framework delivers measurable improvements: 85% boilerplate reduction (1000→150 lines), 5x faster team onboarding (weeks→days), rare IAM errors (auto-generated policies), and 100% compliance (automatic naming, tagging, encryption, permission boundaries). Every resource is encrypted with KMS, tagged per policy, and uses least-privilege IAM—all enforced by building blocks with zero manual reviews.


14. Future Enhancements

Planned Features

1. Multi-Region Support

Enable workloads spanning multiple AWS regions:

regions = ["eu-central-1", "us-east-1"]
s3_buckets = {
replicated_data = {
name = "data"
replication_regions = ["us-east-1"]
}
}

2. Enhanced Lookup Syntax

Support nested lookups:

environment = {
BUCKET_PATH = "s3:mybucket:/path/prefix"
TABLE_COLUMN = "dynamodb:mytable:attribute:id"
}

3. Building Block Customization

Allow team-specific overrides while maintaining compliance:

s3_buckets = {
special = {
name = "special"
override_defaults = {
versioning_enabled = false # Team takes responsibility
}
}
}

4. Cost Estimation

Integrate with AWS Pricing API to estimate costs before apply:

# In plan output:
# Estimated monthly cost: $1,234.56
# - Lambda: $123.45
# - S3: $456.78
# - Redshift: $654.33

5. Dependency Visualization

Generate visual dependency graphs from lookup tables:

S3:raw → Lambda:processor → S3:processed → Glue:transform → Redshift:analytics

Potential Improvements

1. Resolve Two-Phase Deployment

Investigate Terraform’s -target flag or module dependencies to eliminate the “apply twice” requirement.

2. Building Block Catalog

Create searchable catalog of building blocks with examples:

  • Searchable by AWS service
  • Filterable by capability (encryption, backups, monitoring)
  • Includes terraform-docs generated documentation

3. Policy Simulation

Pre-validate IAM policies using AWS IAM Policy Simulator before apply:

Terminal window
terraform plan | policy-simulator --validate

4. Drift Detection

Automated drift detection for resources created outside Terraform:

Terminal window
terraform-drift-detector --alert slack://channel

15. Conclusion

Summary

We’ve built a Native Terraform IaC Framework that achieves the developer experience of high-level abstractions while maintaining 100% compatibility with standard Terraform workflows. The key innovations are:

  1. Smart Lookup Syntax: Colon-separated references (s3:bucket, lambda:function) resolved via native Terraform expressions
  2. Building Block Abstraction: Opinionated modules that enforce standards and generate IAM policies automatically
  3. Zero Preprocessing: Pure Terraform - works with Terraform Cloud, CLI, and all standard tooling
  4. Clear Separation: Platform team manages code, application teams manage configuration
  5. Context Propagation: Naming and tagging enforced automatically via context system

Why This Matters

For Platform Engineers:

  • Enforce organizational standards without restricting teams
  • Reduce support burden (teams self-service)
  • Centralized updates via building blocks
  • Scalable to hundreds of workloads

For Application Teams:

  • Write configuration, not code
  • No AWS expertise required
  • Fast onboarding (days, not weeks)
  • Focus on business logic, not infrastructure

For Organizations:

  • Consistent security posture
  • Automated compliance
  • Cost visibility via standardized tagging
  • Reduced risk (guardrails prevent misconfigurations)

Key Takeaways

  1. Native Terraform is Powerful: With creative use of locals and lookups, you can build sophisticated abstractions without preprocessing

  2. Configuration Over Code: Separating what (tfvars) from how (modules) reduces complexity

  3. Building Blocks Scale: Opinionated modules enable governance at scale

  4. Developer Experience Matters: Investment in ergonomics pays dividends in velocity and adoption

  5. Standards Enable Freedom: Guardrails paradoxically enable teams to move faster

┌─────────────────────────────────┐
│ Native Terraform Framework │
└──────────────┬──────────────────┘
┌───────────────────────┼───────────────────────┐
│ │ │
▼ ▼ ▼
┌───────────────┐ ┌──────────────────┐ ┌──────────────────┐
│ Smart Lookups │ │ Building Blocks │ │ Separation of │
│ │ │ │ │ Code & Config │
└───────┬───────┘ └────────┬─────────┘ └────────┬─────────┘
│ │ │
│ ┌───────▼────────┐ │
│ │Context │ │
│ │Propagation │ │
│ └───────┬────────┘ │
│ │ │
└────────────────────┼───────────────────────┘
┌────────────────────┼────────────────────┐
│ │ │
▼ ▼ ▼
┌────────────────┐ ┌─────────────────┐ ┌──────────────────┐
│ 85% Boilerplate│ │ Zero │ │ Automated │
│ Reduction │ │ Preprocessing │ │ Updates at Scale │
└────────┬───────┘ └────────┬────────┘ └─────────┬────────┘
│ │ │
│ ┌───────▼────────┐ │
│ │ 100% │ │
│ │ Compliance │ │
│ └───────┬────────┘ │
│ │ │
└───────────────────┼──────────────────────┘
┌──────────────────────┐
│ 50+ Teams Can │
│ Self-Service │
│ Infrastructure │
└──────────────────────┘

TL;DR - Conclusion: This native Terraform framework proves that developer-friendly IaC doesn’t require preprocessing or external tools. By combining smart lookups (s3:bucket), opinionated building blocks, configuration/code separation, and context propagation, we achieve 85% boilerplate reduction while maintaining full Terraform Cloud compatibility. Platform teams scale updates via automated PRs, application teams self-service via simple tfvars, and organizations get automatic compliance. Native Terraform can be elegant, scalable, and secure.


16. Getting Started Guide

For teams interested in adopting this approach:

Step 1: Assess Your Needs

Good fit if:

  • Multiple teams deploying similar infrastructure
  • Need to enforce organizational standards
  • Want to reduce AWS expertise requirement
  • High volume of infrastructure deployments

Not a good fit if:

  • Small team (1-2 people) with custom requirements
  • Infrastructure is highly heterogeneous
  • Team prefers low abstraction level

Step 2: Start Small

Begin with a pilot:

  1. Choose one AWS service (e.g., S3)
  2. Build an opinionated building block module
  3. Create lookup mechanism for that service
  4. Test with one team
  5. Iterate based on feedback

Step 3: Build Your Building Blocks

For each AWS service:

  1. Define organizational standards (naming, tagging, encryption)
  2. Create Terraform module enforcing standards
  3. Add permission generation logic
  4. Version and publish to private registry
  5. Write documentation and examples

Step 4: Create Lookup System

  1. Define lookup syntax (e.g., type:name)
  2. Create lookup locals maps
  3. Add resolution logic to building blocks
  4. Test cross-resource references

Step 5: Document and Socialize

  1. Write comprehensive README
  2. Create example projects
  3. Run training sessions
  4. Set up support channel
  5. Gather feedback and iterate

Step 6: Scale

  1. Add more building blocks incrementally
  2. Onboard teams progressively
  3. Monitor usage and pain points
  4. Continuously improve based on feedback

Appendix: Code Samples

A. Lookup Table Implementation

File: tf-project/terraform/managed_by_dp_project_lookup.tf

locals {
# Build base lookup maps for ARNs (used in IAM policies)
lookup_arn_base = merge(var.lookup_arns, {
"kms" = {
"kms_data" = local.kms_data_key_arn
"kms_infra" = local.kms_infrastructure_key_arn
}
"s3_read" = { for item in keys(var.s3_buckets) : item => module.s3_buckets[item].arn }
"s3_write" = { for item in keys(var.s3_buckets) : item => module.s3_buckets[item].arn }
"gluejob" = { for item in keys(var.glue_jobs) : item => module.glue_jobs[item].arn }
"gluedb" = { for item in keys(var.glue_database) : item => module.glue_databases[item].name }
"secret_read" = { for item in keys(var.secrets) : item => module.secrets[item].arn }
"dynamodb_read" = { for item in keys(var.dynamodb_databases) : item => module.dynamodb[item].arn }
})
# Build base lookup maps for IDs (used in environment variables)
lookup_id_base = merge(var.lookup_ids, {
"s3" = { for item in keys(var.s3_buckets) : item => module.s3_buckets[item].id }
"secret" = { for item in keys(var.secrets) : item => module.secrets[item].id }
"dynamodb" = { for item in keys(var.dynamodb_databases) : item => module.dynamodb[item].name }
"athena" = { for item in keys(var.athena_workgroups) : item => module.athena[item].name }
})
# Specialized lookup for Lambda permissions
lookup_perm_lambda = merge(
local.lookup_arn_base,
local.lookup_perm_lambda_xa, # Cross-account additions
{
"sqs_read" = { for item in keys(var.sqs_queues) : item => module.sqs[item].queue_arn }
"sqs_send" = { for item in keys(var.sqs_queues) : item => module.sqs[item].queue_arn }
"sns_pub" = { for item in keys(var.sns_topics) : item => module.sns[item].topic_arn }
}
)
# Specialized lookup for Lambda environment variables
lookup_id_lambda = merge(
local.lookup_id_base,
{
"sqs" = { for item in keys(var.sqs_queues) : item => module.sqs[item].queue_url }
"sns" = { for item in keys(var.sns_topics) : item => module.sns[item].topic_arn }
}
)
}

B. Lambda Building Block Usage

File: tf-project/terraform/managed_by_dp_project_lambda.tf

module "lambda" {
source = "app.terraform.io/org/buildingblock-lambda/aws"
version = "3.2.0"
for_each = var.lambda_functions
# Standard fields
prefix = local.prefix
context = local.context
name = try(each.value.name, each.key)
# Environment variables with smart lookup
environments = {
for type, item in try(each.value.environment, {}) : type =>
try(
# Try to resolve as "type:name" lookup
local.lookup_id_lambda[split(":", item)[0]][split(":", item)[1]],
item # Fallback to literal value
)
}
# Permissions with smart lookup and automatic policy generation
permissions = {
for type, items in try(each.value.permissions, {}) : type => [
for item in items :
(
# Check if it's namespaced format "type:name"
length(split(":", item)) == 2
? try(
local.lookup_perm_lambda[split(":", item)[0]][split(":", item)[1]],
item
)
: try(
# Infer type from permission category key
local.lookup_perm_lambda[type][item],
item
)
)
]
}
# Create IAM role and policy automatically
create_policy = true
# Injected infrastructure details
kms_key_arn = local.kms_data_key_arn
subnet_ids = local.subnet_ids
vpc_id = local.vpc_id
# User-provided configuration
handler = each.value.handler
runtime = each.value.runtime
memory = try(each.value.memory, 512)
timeout = try(each.value.timeout, 300)
description = try(each.value.description, "")
# Resolve S3 source file location
s3_bucket = local.code_bucket
s3_key = split(":", each.value.s3_sourcefile)[0] == "s3_file"
? try(
local.s3_target_path[split(":", each.value.s3_sourcefile)[1]],
each.value.s3_sourcefile
)
: each.value.s3_sourcefile
}

C. Example Workload Configuration

File: examples/full_test/_project.auto.tfvars

# S3 Buckets
s3_buckets = {
raw_data = {
name = "raw"
backup = true
lifecycle_rules = [{
id = "archive_old_data"
transition_days = 90
storage_class = "GLACIER"
}]
}
processed_data = {
name = "processed"
enable_intelligent_tiering = true
enable_eventbridge_notification = true
}
}
# Upload code artifacts
s3_source_files = {
processor_code = {
source = "lambda_processor.zip"
target = "lambda_functions/processor/code.zip"
}
transform_script = {
source = "glue_transform.py"
target = "glue_jobs/transform/script.py"
}
}
# Secrets
secrets = {
database_creds = {
name = "db-credentials"
secret_string = {
username = "admin"
password = "" # Set via AWS Console
}
}
}
# Glue Database
glue_database = {
analytics = {
name = "analytics"
bucket = "s3:processed_data"
enable_lakeformation = true
share_cross_account_ro = ["datagovernance"]
}
}
# Lambda Function
lambda_functions = {
data_processor = {
name = "processor"
description = "Processes incoming data files"
handler = "index.handler"
runtime = "python3.13"
memory = 2048
timeout = 900
in_vpc = true
s3_sourcefile = "s3_file:processor_code"
environment = {
RAW_BUCKET = "s3:raw_data"
PROCESSED_BUCKET = "s3:processed_data"
GLUE_DATABASE = "gluedb:analytics"
DB_SECRET = "secret:database_creds"
LOG_LEVEL = "INFO"
}
permissions = {
s3_read = ["raw_data"]
s3_write = ["processed_data"]
glue_update = ["analytics"]
secret_read = ["database_creds"]
}
event_source_mapping = [{
event_source_arn = "s3:raw_data"
events = ["s3:ObjectCreated:*"]
filter_prefix = "incoming/"
filter_suffix = ".csv"
}]
}
}
# Glue ETL Job
glue_jobs = {
data_transform = {
name = "transform"
description = "Transforms processed data"
glue_version = "4.0"
worker_type = "G.1X"
number_of_workers = 5
max_retries = 2
timeout = 120
script_location = "s3_file:transform_script"
arguments = {
"--job-language" = "python"
"--enable-metrics" = "true"
"--enable-continuous-cloudwatch-log" = "true"
"--DATABASE" = "gluedb:analytics"
"--INPUT_BUCKET" = "s3:processed_data"
"--DB_SECRET" = "secret:database_creds"
}
permissions = {
s3_read = ["processed_data"]
glue_update = ["analytics"]
secret_read = ["database_creds"]
}
trigger_type = "SCHEDULED"
schedule = "cron(0 2 * * ? *)" # Daily at 2 AM UTC
}
}

Final Thoughts

This framework demonstrates that native Terraform can be elegant and developer-friendly without sacrificing power or flexibility. By leveraging Terraform’s built-in features creatively—for expressions, try() functions, split() operations, and locals—we’ve built a system that:

  • Feels like configuration (simple tfvars files)
  • Works like Terraform (native tooling, no preprocessing)
  • Scales like a platform (hundreds of workloads, multiple teams)
  • Governs like policy (automatic enforcement, no manual reviews)

The journey from verbose, error-prone Terraform code to concise, validated configuration files represents a significant step forward in Infrastructure as Code maturity. Most importantly, it’s achieved through native Terraform capabilities, ensuring long-term compatibility and eliminating external dependencies.

As organizations scale their cloud infrastructure, frameworks like this become essential for maintaining velocity, consistency, and security. The patterns demonstrated here can be adapted to any cloud provider, resource types, or organizational requirements—the principles of smart lookups, building block abstraction, and configuration separation are universally applicable.

The future of Infrastructure as Code is declarative, native, and developer-friendly. This framework is a blueprint for getting there.


Acknowledgments

This framework was built by collaborative iteration between platform engineers and application teams, learning from real-world challenges and continuously refining the developer experience. Special recognition to the teams who adopted early versions, provided feedback, and helped shape the patterns documented here.