Article

Advanced Terraform: Patterns for Teams at Scale

by Gary Worthington, More Than Monkeys

Terraform feels simple at first. You write a few .tf files, terraform apply, and you’ve got infrastructure. But once you introduce multiple teams, multiple AWS accounts, and pipelines deploying continuously, the complexity creeps in.

At that point, the challenge isn’t just “how do I declare resources?”. It’s

how do I make Terraform predictable, safe, and maintainable when dozens of people and pipelines are touching it at once?

This post goes through advanced Terraform patterns I use in production, with the what, the why, and the traps to avoid.

1. Providers at Scale: Aliases and Cross-Account Deployments

(edited: thanks to mbarr in the comments for pointing out aws v6 provider allows for region args for certain resources)

A single provider block is fine for simple setups, but most real systems span multiple regions or even multiple AWS accounts. For example, you might run an application in eu-west-1 while sending logs to a centralised bucket in us-east-1, or keep networking in a shared services account while apps run elsewhere.

In AWS provider v6, many resources now accept a top-level region argument. That means you can target multiple regions within the same account using a single provider, without introducing extra complexity:

provider "aws" {
region = "eu-west-1" # default
}

resource "aws_s3_bucket" "home" {
bucket = "my-logs-eu-west-1"
}
resource "aws_s3_bucket" "dr" {
bucket = "my-logs-eu-central-1"
region = "eu-central-1"
}

This reduces the need for provider duplication when your only concern is geography.

But aliases are still essential. An alias is a named variant of a provider configuration. You define it with alias = "name", and then reference it on resources or when passing providers into modules. Aliases let you create multiple independent configurations of the same provider—for example, one per account, or one with a different authentication method.

provider "aws" {
alias = "prod"
region = "eu-west-1"
assume_role {
role_arn = "arn:aws:iam::222222222222:role/prod"
}
}

resource "aws_s3_bucket" "logs_prod" {
provider = aws.prod
bucket = "centralised-logs"
}

When working at scale:

  • Use the per-resource region argument when targeting multiple regions in the same account.
  • Use aliases for cross-account deployments, global services (IAM, Route 53, CloudFront), or whenever you need different provider-level settings (default tags, custom endpoints, retry configs).
  • Always pass providers explicitly into modules (providers = { aws = aws, aws.prod = aws.prod }) to avoid child modules accidentally picking up the wrong one.

Aliases make intent explicit: you can see at a glance which account or context a resource belongs to. Without them, it’s easy to create resources in the wrong place; a mistake that often isn’t caught until production.

2. Secrets and Sensitive Data

(edited: thanks to mbarr in the comments for pointing out that as of version 1.10, terraform introduced ephemeral values.)

Managing secrets has always been one of Terraform’s awkward edges. By default, arguments like password or secret_string get stored in both plan and state files. That makes rotation and compliance harder than it should be.

Terraform 1.10 introduced ephemeral values; inputs, outputs, and ephemeral resources that only exist during a run. Terraform 1.11 extended this with write-only arguments on managed resources. These accept a value at apply time but never persist it into state. Together, these features give us a way to inject and consume secrets without leaving them lying around in plain text.

Pattern: generate → store (optional) → consume (write-only)

# 1) Create a temporary password (ephemeral resource)
ephemeral "random_password" "db" {
length = 20
override_special = "!#$%&*()-_=+[]{}<>:?"
}

# 2) (Optional) Persist it securely without exposing state
resource "aws_secretsmanager_secret" "db_pw" {
name = "db_password"
}

resource "aws_secretsmanager_secret_version" "db_pw" {
secret_id = aws_secretsmanager_secret.db_pw.id
secret_string_wo = ephemeral.random_password.db.result # write-only
secret_string_wo_version = 1
}

# 3) Consume via write-only arguments on the target resource
resource "aws_db_instance" "db" {
engine = "postgres"
instance_class = "db.t3.micro"
allocated_storage = 20
username = "appuser"
# write-only args: used during apply, never stored in plan/state
password_wo = ephemeral.random_password.db.result
password_wo_version = 1
}

Key points:

  • Ephemeral blocks generate values that live only during the run.
  • Write-only arguments prevent secrets from ever appearing in plan or state.
  • Providers (like AWS) expose specific write-only fields such as password_wo or secret_string_wo. Not every resource supports them yet, so check provider docs.

In short, use ephemeral to create or fetch sensitive values, and use write-only arguments to consume them safely. This removes one of the main reasons secrets have historically leaked into Terraform state files.

3. Modules at Scale

When you manage multiple similar resources, you want to avoid brittle addressing. Using count gives you array-style indexes, and if the order changes, Terraform thinks resources must be destroyed and recreated.

To help with this, use for_each, which uses map keys for stable addressing.

locals {
buckets = {
analytics = { versioning = true }
archive = { versioning = false }
}
}

resource "aws_s3_bucket" "b" {
for_each = local.buckets
bucket = "company-${each.key}"
}
  • Adding or removing entries doesn’t renumber the whole set.
  • Keys act as stable IDs, making refactors safer.

I’ve seen teams wipe production buckets because they added a new entry at the top of a count list. for_each removes that risk.

4. Conditional Resources

Not every environment needs every resource. A bastion host may only be needed in production. A costlier logging setup might be disabled in dev. Without conditional logic, you either duplicate code or risk mistakes by commenting things out.

Use for_each (or count) conditionals to create resources only when required.

resource "aws_instance" "bastion" {
for_each = var.environment == "prod" ? { "prod" = true } : {}
ami = "ami-0abcd1234"
instance_type = "t3.micro"
}

Why this matters:

  • This keeps your stacks consistent across environments.
  • The plan output will clearly show whether the resource exists.
  • Avoids drift between environments caused by manual hacks.

As mentioned earier, I prefer for_each over count so the addressing stays stable when you later extend the config.

5. Dynamic Blocks

Terraform resources often have nested arguments (like security group rules). Copy-pasting them quickly becomes unmaintainable, especially when rules differ by environment or input.

I use dynamic blocks to generate nested configuration from variables.

variable "allowed_ports" {
type = list(number)
default = [22, 80, 443]
}

resource "aws_security_group" "web" {
name = "web-sg"
dynamic "ingress" {
for_each = var.allowed_ports
content {
from_port = ingress.value
to_port = ingress.value
protocol = "tcp"
cidr_blocks = ["0.0.0.0/0"]
}
}
}

Why this matters:

  • Rules become data-driven.
  • Adding or removing ports is a variable change, not a code change.
  • Keeps security group definitions DRY and consistent.

6. Data Sources and Cross-Stack Composition

In larger infrastructures, stacks depend on each other. Your application stack needs VPC IDs from your networking stack. Hard-coding them is brittle, but duplicating resource definitions causes drift.

I use remote state data sources to consume outputs from other stacks.

data "terraform_remote_state" "networking" {
backend = "s3"
config = {
bucket = "tf-states"
key = "networking/prod.tfstate"
region = "eu-west-1"
}
}

module "app" {
source = "../modules/app"
vpc_id = data.terraform_remote_state.networking.outputs.vpc_id
subnets = data.terraform_remote_state.networking.outputs.private_subnets
}

Why this matters:

  • Keeps stacks loosely coupled.
  • Networking can evolve independently from the app stack.
  • Consumers only depend on published outputs, not internal details.

Pro tip: don’t overuse remote state. For very widely used values, consider publishing them into SSM Parameter Store or a config repo so consumers aren’t tightly bound to Terraform internals.

7. Lifecycle Rules and Safety Nets

Some resources are too important to accidentally destroy (think production S3 buckets). Others need special handling during replacement (like launch templates).

To help protect these resources, I use lifecycle rules to enforce intent.

resource "aws_s3_bucket" "data" {
bucket = "prod-data"

lifecycle {
prevent_destroy = true
ignore_changes = [policy]
}
}
  • prevent_destroy forces you to consciously remove the guard before destroying.
  • ignore_changes prevents Terraform from overwriting fields managed elsewhere (like policies written by security tooling).
  • create_before_destroy is essential for resources where downtime is unacceptable.

Lifecycle rules are your last line of defence against human error. They make destructive actions explicit, not accidental.

8. Testing Terraform

Terraform plans can look fine, but break when applied, especially when you use modules at scale. Without testing, every change risks production.

Treat Terraform like any other codebase: validate, lint, and test.

Validate syntax and schemas

terraform fmt -check terraform validate

Lint for best practices (using tflint):

tflint

Integration tests with Terratest:

func TestVpcModule(t *testing.T) {
opts := &terraform.Options{
TerraformDir: "../stacks/networking",
Vars: map[string]interface{}{"environment": "test"},
}
defer terraform.Destroy(t, opts)
terraform.InitAndApply(t, opts)

vpcId := terraform.Output(t, opts, "vpc_id")
require.NotEmpty(t, vpcId)
}

Automated checks catch mistakes before they hit production. The same discipline we use for app code applies to infra.

9. Handling Drift

People change things in the AWS console. CI jobs fail mid-apply. Over time, state drifts from reality.

Get to know your repair tools.

  • terraform plan always detects drift.
  • terraform apply -refresh-only updates state to match real infra without applying changes.
  • terraform import brings existing resources into Terraform.
  • moved blocks (Terraform 1.5+) record renames or refactors safely.
moved {
from = aws_iam_role.app
to = aws_iam_role.app_role
}

Drift is inevitable. The difference between junior and expert teams is whether they fix drift by clicking around, or by reconciling state and code properly.

10. Policy as Code

As your org grows, you need guardrails. How do you stop a junior engineer from opening SSH to the world, or spinning up huge instances in dev?

Enforce rules with policy-as-code.

Example with Conftest + OPA

package terraform.aws.security

deny[msg] {
input.resource_type == "aws_security_group_rule"
input.change.after.cidr_blocks[_] == "0.0.0.0/0"
input.change.after.from_port == 22
msg := "SSH open to the world is not allowed"
}

Run this against terraform plan -json in CI to block unsafe changes before apply.

Policy-as-code scales governance without slowing down delivery. Instead of relying on human review, rules are enforced automatically across every plan.

Final Thoughts

Terraform’s real power isn’t in creating a bucket or an EC2. It’s in managing infrastructure safely at scale.

  • Use aliases for clarity across accounts.
  • Treat secrets and state as first-class sensitive data.
  • Prefer for_each for stability.
  • Use dynamic blocks and remote state wisely to keep stacks DRY but decoupled.
  • Add lifecycle rules and policy as code to protect production.
  • Always test and lint your infra like any other codebase.

Get these patterns right, and Terraform becomes boring, and that’s exactly what you want when it’s managing production.

Gary Worthington is a software engineer, delivery consultant, and agile coach who helps teams move fast, learn faster, and scale when it matters. He writes about modern engineering, product thinking, and helping teams ship things that matter.

Through his consultancy, More Than Monkeys, Gary helps startups and scaleups improve how they build software — from tech strategy and agile delivery to product validation and team development.

Visit morethanmonkeys.co.uk to learn how we can help you build better, faster.

Follow Gary on LinkedIn for practical insights into engineering leadership, agile delivery, and team performance