Terraform: The Complete Guide from First Resource to Team IaC
Master Terraform from zero to team-scale: providers, state, modules, workspaces, remote backends, CI/CD pipelines, and how IaC fits into an agile sprint workflow.
Think about the difference between a restaurant that writes its recipes down and one where the head chef keeps everything in their head. The second restaurant is held hostage by a single person. The moment that person is sick, on vacation, or simply gone, the kitchen stops working. Nobody else can reproduce what they make. The knowledge is not in the system β it is in one personβs memory.
Most infrastructure teams start like that second restaurant. Servers provisioned through web consoles, configurations applied by hand, the exact steps known only to whoever was at the keyboard that day. It works until something breaks, until someone leaves, until you need a second copy of the environment, until an audit asks you to prove what changed and when.
Terraform is the discipline of writing the recipe down. Not as documentation that gets stale, but as executable code that provisions the exact infrastructure you describe, every time, on any cloud, with a record of every change.
What Terraform Actually Does
Before touching code, the mental model matters. Terraform does three things:
It declares desired state. You describe the infrastructure you want in HCL (HashiCorp Configuration Language). A server with 2 CPUs, a firewall with specific rules, a DNS record pointing at that serverβs IP. Terraform does not tell it how to create these β it tells it what they should look like when done.
It computes a plan. Terraform compares your declared state against the current real state of your infrastructure and computes the minimal set of changes needed to reach the desired state. Create this, modify that, destroy the other thing. You review the plan before anything is applied.
It applies changes and records them. When you approve the plan, Terraform executes the API calls to create, modify, or destroy resources. After each run, it records the resulting state in a state file. That file is what Terraform uses to compare against on the next run.
The workflow is always the same: write β terraform plan β review β terraform apply. No exceptions in production.
Who Uses Terraform and How
Terraform serves different roles depending on your position in the team. Understanding this prevents the common mistake of treating it as a purely senior-engineer tool.
Junior developers use Terraform in read mode first. They run terraform plan to understand what changes a PR will make. They read existing modules to understand how the environment is structured. They add environment variables, update DNS records, resize resources within existing modules. The goal is comfort with the workflow before writing new infrastructure.
Mid-level engineers write new resources within an existing structure, create module calls, update provider versions, and manage workspace-specific configurations. They review plans before production applies and participate in incident response when infrastructure changes cause issues.
Senior DevOps engineers design the module hierarchy, set remote backend configuration, define the team workflow (branch policies, CI/CD integration, review requirements), manage provider upgrades, and own the state. They write the recipes that others follow.
Tech leads and architects define which clouds and which services, make buy-vs-build decisions on modules (write your own vs use the Terraform Registry), set the standards for naming conventions, tagging, and resource lifecycle policies.
Level 1: Your First Terraform Project
Start with the smallest possible thing that is still real: a single server with a firewall.
Installation
# on macOS
brew tap hashicorp/tap
brew install hashicorp/tap/terraform
# on Linux (Debian/Ubuntu)
wget -O - https://apt.releases.hashicorp.com/gpg | sudo gpg --dearmor -o /usr/share/keyrings/hashicorp-archive-keyring.gpg
echo "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/hashicorp-archive-keyring.gpg] https://apt.releases.hashicorp.com $(lsb_release -cs) main" | sudo tee /etc/apt/sources.list.d/hashicorp.list
sudo apt update && sudo apt install terraform
# verify
terraform version
Project Structure
my-first-infra/
βββ main.tf # resources
βββ variables.tf # input variable declarations
βββ outputs.tf # output value declarations
βββ terraform.tfvars # actual variable values (never commit secrets here)
Provider Configuration
A provider is a plugin that knows how to talk to a specific API β AWS, GCP, Hetzner, Cloudflare, GitHub, etc. You declare which providers you need and Terraform downloads them.
# main.tf
terraform {
required_version = ">= 1.7.0"
required_providers {
hcloud = {
source = "hetznercloud/hcloud"
version = "~> 1.45" # ~> means >= 1.45.0, < 2.0.0
}
}
}
provider "hcloud" {
token = var.hcloud_token
}
Your First Resource
# main.tf β continued
resource "hcloud_server" "web" {
name = "web-${var.environment}"
server_type = "cx22" # 2 vCPU, 4GB RAM
image = "ubuntu-24.04"
location = "nbg1" # Nuremberg
ssh_keys = [hcloud_ssh_key.default.id]
labels = {
environment = var.environment
managed-by = "terraform"
}
}
resource "hcloud_ssh_key" "default" {
name = "deploy-key"
public_key = file("~/.ssh/id_ed25519.pub")
}
resource "hcloud_firewall" "web" {
name = "web-${var.environment}"
rule {
direction = "in"
protocol = "tcp"
port = "22"
source_ips = ["0.0.0.0/0", "::/0"]
}
rule {
direction = "in"
protocol = "tcp"
port = "443"
source_ips = ["0.0.0.0/0", "::/0"]
}
}
resource "hcloud_firewall_attachment" "web" {
firewall_id = hcloud_firewall.web.id
server_ids = [hcloud_server.web.id]
}
Variables
# variables.tf
variable "hcloud_token" {
type = string
sensitive = true
description = "Hetzner Cloud API token"
}
variable "environment" {
type = string
description = "staging or production"
validation {
condition = contains(["staging", "production"], var.environment)
error_message = "environment must be staging or production"
}
}
# terraform.tfvars β values for local use
# DO NOT commit this file. Add to .gitignore.
hcloud_token = "your-api-token-here"
environment = "staging"
Outputs
# outputs.tf
output "server_ip" {
value = hcloud_server.web.ipv4_address
description = "Public IPv4 address of the web server"
}
output "server_id" {
value = hcloud_server.web.id
}
The First Run
# initialize β downloads providers, sets up backend
terraform init
# preview what will be created
terraform plan
# apply the plan
terraform apply
# after apply, inspect the state
terraform show
# view specific outputs
terraform output server_ip
# when you are done with the resource
terraform destroy
The first terraform plan output teaches you to read Terraform. Resources prefixed with + will be created. Those with ~ will be modified in place. Those with -/+ will be destroyed and recreated (because the change cannot be made in place). Those with - will be destroyed. Reviewing this output carefully is the single most important habit in Terraform.
Level 2: State Management
State is what makes Terraform different from a script. It is also what makes it dangerous when misunderstood.
The state file (terraform.tfstate) records the mapping between your HCL declarations and the real resource IDs in the provider. Without it, Terraform cannot know that hcloud_server.web corresponds to server ID 12345678 in Hetzner. It would try to create a new one on every run.
What State Contains
{
"version": 4,
"terraform_version": "1.7.3",
"resources": [
{
"mode": "managed",
"type": "hcloud_server",
"name": "web",
"provider": "provider[\"registry.terraform.io/hetznercloud/hcloud\"]",
"instances": [
{
"schema_version": 0,
"attributes": {
"id": "12345678",
"name": "web-staging",
"ipv4_address": "1.2.3.4",
"status": "running"
// ... all resource attributes
}
}
]
}
]
}
The Problem with Local State
Local state (terraform.tfstate on your disk) fails immediately when:
- Two engineers run
terraform applysimultaneously β state conflict, one overwrites the other - Your laptop breaks β state is gone, Terraform cannot reconcile with real infrastructure
- You need to run Terraform from CI/CD β no access to your local file
The solution is a remote backend with state locking.
Remote Backend: S3 + DynamoDB
terraform {
backend "s3" {
bucket = "your-org-terraform-state"
key = "projects/my-app/staging/terraform.tfstate"
region = "us-east-1"
encrypt = true
dynamodb_table = "terraform-state-locks" # for locking
}
}
Create the S3 bucket and DynamoDB table once (manually or with a bootstrap Terraform config):
# bootstrap/main.tf β run this once to create the state backend
resource "aws_s3_bucket" "terraform_state" {
bucket = "your-org-terraform-state"
}
resource "aws_s3_bucket_versioning" "terraform_state" {
bucket = aws_s3_bucket.terraform_state.id
versioning_configuration {
status = "Enabled" # versioning gives you state history and rollback
}
}
resource "aws_s3_bucket_server_side_encryption_configuration" "terraform_state" {
bucket = aws_s3_bucket.terraform_state.id
rule {
apply_server_side_encryption_by_default {
sse_algorithm = "AES256"
}
}
}
resource "aws_dynamodb_table" "terraform_locks" {
name = "terraform-state-locks"
billing_mode = "PAY_PER_REQUEST"
hash_key = "LockID"
attribute {
name = "LockID"
type = "S"
}
}
With this backend, when anyone runs terraform apply:
- Terraform acquires a lock in DynamoDB β no other process can apply simultaneously
- It reads the latest state from S3
- It applies the plan
- It writes the new state to S3
- It releases the lock
Terraform Cloud / HCP Terraform (Alternative)
HashiCorpβs managed backend handles remote state, locking, and plan execution with a web UI and API. Free tier covers most small teams. Configure with:
terraform {
cloud {
organization = "your-org"
workspaces {
name = "my-app-staging"
}
}
}
Level 3: Modules
A module is a reusable unit of infrastructure. It is the difference between copying and pasting the same server + firewall + DNS record configuration four times for four environments versus defining it once and calling it four times with different parameters.
When to Write a Module
Write a module when you have two or more similar resource groups that differ only in their inputs. A single resource is not a module. Three identical server setups with different names are.
Module Structure
modules/
server/
main.tf # the resources
variables.tf # inputs
outputs.tf # what callers can reference
versions.tf # required provider versions
database/
main.tf
variables.tf
outputs.tf
versions.tf
environments/
staging/
main.tf # calls modules
variables.tf
outputs.tf
backend.tf
production/
main.tf
variables.tf
outputs.tf
backend.tf
Writing a Module
# modules/server/variables.tf
variable "name" {
type = string
description = "Server name"
}
variable "server_type" {
type = string
default = "cx22"
description = "Hetzner server type"
}
variable "environment" {
type = string
}
variable "allowed_ports" {
type = list(number)
default = [80, 443]
description = "Inbound TCP ports to allow"
}
# modules/server/main.tf
resource "hcloud_server" "this" {
name = var.name
server_type = var.server_type
image = "ubuntu-24.04"
location = "nbg1"
labels = {
environment = var.environment
managed-by = "terraform"
}
}
resource "hcloud_firewall" "this" {
name = "${var.name}-firewall"
dynamic "rule" {
for_each = var.allowed_ports
content {
direction = "in"
protocol = "tcp"
port = tostring(rule.value)
source_ips = ["0.0.0.0/0", "::/0"]
}
}
}
resource "hcloud_firewall_attachment" "this" {
firewall_id = hcloud_firewall.this.id
server_ids = [hcloud_server.this.id]
}
# modules/server/outputs.tf
output "ipv4_address" {
value = hcloud_server.this.ipv4_address
}
output "server_id" {
value = hcloud_server.this.id
}
Calling the Module
# environments/staging/main.tf
module "web" {
source = "../../modules/server"
name = "web-staging"
environment = "staging"
server_type = "cx22"
allowed_ports = [80, 443]
}
module "api" {
source = "../../modules/server"
name = "api-staging"
environment = "staging"
server_type = "cx32"
allowed_ports = [8080]
}
# reference module outputs
output "web_ip" {
value = module.web.ipv4_address
}
Using Registry Modules
The Terraform Registry has verified modules for common patterns β VPCs, EKS clusters, RDS instances. Use them instead of reinventing.
# use the official AWS VPC module
module "vpc" {
source = "terraform-aws-modules/vpc/aws"
version = "5.5.1" # always pin the version
name = "main-vpc"
cidr = "10.0.0.0/16"
azs = ["us-east-1a", "us-east-1b", "us-east-1c"]
private_subnets = ["10.0.1.0/24", "10.0.2.0/24", "10.0.3.0/24"]
public_subnets = ["10.0.101.0/24", "10.0.102.0/24", "10.0.103.0/24"]
enable_nat_gateway = true
single_nat_gateway = true # cost optimization β use one NAT gateway
tags = {
Environment = var.environment
ManagedBy = "terraform"
}
}
Level 4: Multi-Environment Workflows
The most common mistake in Terraform is treating staging and production as the same configuration with different variable values. They are not. They can diverge significantly in resource types, sizes, replica counts, and external integrations. The right model separates them structurally.
Separate Directories Per Environment
environments/
staging/
backend.tf # backend "s3" { key = "staging/terraform.tfstate" }
main.tf # module calls with staging-specific inputs
variables.tf
terraform.tfvars
production/
backend.tf # backend "s3" { key = "production/terraform.tfstate" }
main.tf # module calls with production-specific inputs
variables.tf
terraform.tfvars
This means separate state files, separate plan/apply cycles, and no risk of accidentally destroying production while targeting staging.
Workspaces (When to Use Them and When Not To)
Terraform workspaces allow multiple state files within a single backend configuration. They look appealing as a βone config, multiple environmentsβ solution:
terraform workspace new staging
terraform workspace new production
terraform workspace select staging
terraform apply
# conditional resource sizing based on workspace
resource "hcloud_server" "web" {
server_type = terraform.workspace == "production" ? "cx32" : "cx22"
}
Workspaces work well for truly identical environments that differ only in scale β ephemeral feature branch environments, for example. They work poorly for staging vs production because the conditional logic accumulates into unreadable configuration. Use separate directories for long-lived, architecturally different environments. Use workspaces for short-lived, structurally identical ones.
Variable Files Per Environment
When using a shared configuration with tfvars files:
# apply with environment-specific variables
terraform apply -var-file="staging.tfvars"
terraform apply -var-file="production.tfvars"
# staging.tfvars
environment = "staging"
server_type = "cx22"
replica_count = 1
enable_backups = false
database_size = "small"
# production.tfvars
environment = "production"
server_type = "cx32"
replica_count = 3
enable_backups = true
database_size = "large"
Level 5: Advanced HCL Patterns
for_each and count
count creates multiple identical copies of a resource. for_each creates one resource per item in a map or set, with each resource independently addressable.
# count β creates 3 servers: server[0], server[1], server[2]
resource "hcloud_server" "worker" {
count = var.worker_count
name = "worker-${count.index}"
server_type = "cx22"
image = "ubuntu-24.04"
}
# for_each β creates servers with meaningful addresses: server["web"], server["api"]
resource "hcloud_server" "services" {
for_each = {
web = { type = "cx22", location = "nbg1" }
api = { type = "cx32", location = "fsn1" }
}
name = each.key
server_type = each.value.type
location = each.value.location
image = "ubuntu-24.04"
}
# reference: hcloud_server.services["web"].ipv4_address
Prefer for_each over count for anything that is not a simple quantity. With count, removing an element from the middle renumbers all subsequent elements, which causes Terraform to destroy and recreate them. With for_each, each element has a stable key.
Dynamic Blocks
variable "security_group_rules" {
type = list(object({
port = number
protocol = string
cidr = string
}))
}
resource "hcloud_firewall" "main" {
name = "main"
dynamic "rule" {
for_each = var.security_group_rules
content {
direction = "in"
protocol = rule.value.protocol
port = tostring(rule.value.port)
source_ips = [rule.value.cidr]
}
}
}
Local Values
Locals reduce repetition and make expressions readable:
locals {
common_tags = {
environment = var.environment
managed-by = "terraform"
project = "my-app"
team = "platform"
}
server_name = "${var.project}-${var.environment}-${var.region}"
is_production = var.environment == "production"
}
resource "hcloud_server" "web" {
name = local.server_name
labels = local.common_tags
}
Data Sources
Data sources read existing infrastructure that Terraform did not create. They let you reference resources managed elsewhere β by another team, another Terraform config, or manually.
# read an existing SSH key by name (created by another team)
data "hcloud_ssh_key" "ops_team" {
name = "ops-team-2025"
}
# use it in a new resource
resource "hcloud_server" "web" {
ssh_keys = [data.hcloud_ssh_key.ops_team.id]
}
# read the latest Ubuntu image ID dynamically
data "hcloud_image" "ubuntu_24" {
name = "ubuntu-24.04"
type = "system"
}
resource "hcloud_server" "web" {
image = data.hcloud_image.ubuntu_24.id # always the latest ubuntu-24.04
}
depends_on
Terraform builds a dependency graph from references between resources. Explicit depends_on is only needed when a dependency exists that Terraform cannot infer from resource references:
resource "hcloud_server" "app" {
# ...
depends_on = [hcloud_firewall_attachment.base]
# needed because app must start after firewall is attached,
# but the app resource has no direct reference to the firewall
}
Use depends_on sparingly. If you find yourself needing it often, it usually means the resource references are not structured correctly.
Level 6: Terraform in Agile Teams
The Terraform Sprint Cycle
In a sprint-based team, infrastructure changes follow a rhythm that must integrate with the feature development cycle without becoming a bottleneck.
The two failure modes are: infrastructure changes blocking feature work (infra team is slow, PR reviews take days) and feature work breaking infrastructure (developers merge infra changes without review).
The solution is treating infrastructure as code with the same review standards as application code, but with one additional step: the plan output is part of the PR review.
Sprint planning β infrastructure tasks created alongside feature tasks
β
Developer creates infra branch β opens PR β CI runs terraform plan
β
PR includes plan output in comments β reviewer reads plan, not just code
β
Merge to main β CI applies to staging automatically
β
Sprint review β staging is the demo environment
β
Release β CI applies to production with manual approval gate
GitHub Actions: Automated Plan on PR
# .github/workflows/terraform.yaml
name: Terraform
on:
pull_request:
paths:
- "infrastructure/**"
push:
branches: [main]
paths:
- "infrastructure/**"
env:
TF_VERSION: "1.7.3"
AWS_REGION: "us-east-1"
jobs:
plan:
name: Plan
runs-on: ubuntu-latest
if: github.event_name == 'pull_request'
permissions:
contents: read
pull-requests: write
id-token: write # for OIDC auth to AWS
steps:
- uses: actions/checkout@v4
- uses: hashicorp/setup-terraform@v3
with:
terraform_version: ${{ env.TF_VERSION }}
- name: Configure AWS credentials (OIDC)
uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: arn:aws:iam::123456789:role/TerraformCIReadOnly
aws-region: ${{ env.AWS_REGION }}
- name: Terraform Init
working-directory: infrastructure/environments/staging
run: terraform init
- name: Terraform Plan
id: plan
working-directory: infrastructure/environments/staging
run: |
terraform plan -no-color -out=tfplan 2>&1 | tee plan.txt
echo "exitcode=${PIPESTATUS[0]}" >> $GITHUB_OUTPUT
- name: Post Plan to PR
uses: actions/github-script@v7
with:
script: |
const fs = require('fs');
const plan = fs.readFileSync('infrastructure/environments/staging/plan.txt', 'utf8');
const truncated = plan.length > 60000 ? plan.slice(0, 60000) + '\n\n... truncated ...' : plan;
const body = `## Terraform Plan β staging\n\`\`\`\n${truncated}\n\`\`\``;
github.rest.issues.createComment({
issue_number: context.issue.number,
owner: context.repo.owner,
repo: context.repo.repo,
body
});
apply-staging:
name: Apply to Staging
runs-on: ubuntu-latest
if: github.event_name == 'push' && github.ref == 'refs/heads/main'
environment: staging
permissions:
contents: read
id-token: write
steps:
- uses: actions/checkout@v4
- uses: hashicorp/setup-terraform@v3
with:
terraform_version: ${{ env.TF_VERSION }}
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: arn:aws:iam::123456789:role/TerraformCIApply
aws-region: ${{ env.AWS_REGION }}
- name: Terraform Init
working-directory: infrastructure/environments/staging
run: terraform init
- name: Terraform Apply
working-directory: infrastructure/environments/staging
run: terraform apply -auto-approve
apply-production:
name: Apply to Production
runs-on: ubuntu-latest
if: github.event_name == 'push' && github.ref == 'refs/heads/main'
needs: [apply-staging]
environment: production # requires manual approval in GitHub Environments
permissions:
contents: read
id-token: write
steps:
- uses: actions/checkout@v4
- uses: hashicorp/setup-terraform@v3
with:
terraform_version: ${{ env.TF_VERSION }}
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: arn:aws:iam::123456789:role/TerraformCIApply
aws-region: ${{ env.AWS_REGION }}
- name: Terraform Init
working-directory: infrastructure/environments/production
run: terraform init
- name: Terraform Apply
working-directory: infrastructure/environments/production
run: terraform apply -auto-approve
OIDC Authentication (No Long-Lived Credentials)
The CI pipeline above uses OIDC to assume an AWS IAM role instead of storing AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY as secrets. This is the correct approach. Long-lived keys get rotated infrequently, leak into logs, and have broad permissions. OIDC tokens are short-lived and scoped to the specific GitHub repository and branch.
// IAM Trust Policy for the TerraformCIReadOnly role
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Federated": "arn:aws:iam::123456789:oidc-provider/token.actions.githubusercontent.com"
},
"Action": "sts:AssumeRoleWithWebIdentity",
"Condition": {
"StringEquals": {
"token.actions.githubusercontent.com:aud": "sts.amazonaws.com"
},
"StringLike": {
"token.actions.githubusercontent.com:sub": "repo:your-org/your-repo:pull_request"
}
}
}
]
}
Two roles, two trust policies, two permission sets:
TerraformCIReadOnlyβ triggered on pull_request, can only read state and run plans (no write permissions to infrastructure)TerraformCIApplyβ triggered on push to main, has write permissions, only trusted for the main branch
Team Review Conventions
The PR checklist for infrastructure changes:
## Infrastructure PR Checklist
- [ ] `terraform fmt` was run β code is formatted
- [ ] `terraform validate` passes
- [ ] Plan output is attached and reviewed
- [ ] Plan shows no unexpected destroys
- [ ] Variable changes are documented in the PR description
- [ ] Secrets are not in the plan output
- [ ] The change has been verified in a local or branch environment first
- [ ] Rollback plan is documented if the change is destructive
Handling Destructive Changes
Terraform will sometimes propose to destroy and recreate a resource when an in-place update is not possible. Common examples: renaming a server (name is immutable), changing an image type (requires new VM), modifying certain database parameters.
Never let a destructive change to a production database or server merge silently. Protect critical resources:
resource "hcloud_server" "web" {
# ...
lifecycle {
prevent_destroy = true # terraform apply will error if a destroy is planned
ignore_changes = [
labels, # ignore label changes managed externally
]
create_before_destroy = true # create the replacement before destroying the old one
}
}
create_before_destroy is essential for servers behind a load balancer β it ensures the new server is healthy before the old one is removed, preventing downtime.
Level 7: Security and Compliance
Least Privilege for CI
The CI role should have exactly the permissions it needs. For a Hetzner + Cloudflare stack, generate a read-only API token for plans and a write token for applies β never the full admin token.
For AWS, scope the IAM policy to the specific resources Terraform manages:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"ec2:Describe*",
"ec2:CreateInstance",
"ec2:TerminateInstances",
"ec2:ModifyInstanceAttribute",
"s3:GetObject",
"s3:PutObject",
"s3:ListBucket",
"dynamodb:GetItem",
"dynamodb:PutItem",
"dynamodb:DeleteItem"
],
"Resource": [
"arn:aws:ec2:*:*:instance/*",
"arn:aws:s3:::your-org-terraform-state/*",
"arn:aws:dynamodb:*:*:table/terraform-state-locks"
]
}
]
}
Static Analysis with tfsec and checkov
Run security analysis on your Terraform code before applying:
# tfsec β scans for security misconfigurations
brew install tfsec
tfsec .
# checkov β broader policy-as-code, covers Terraform + Kubernetes + Dockerfiles
pip install checkov
checkov -d .
Add both to the CI plan job:
- name: Run tfsec
uses: aquasecurity/tfsec-action@v1.0.0
with:
working_directory: infrastructure/
- name: Run checkov
uses: bridgecrewio/checkov-action@v12
with:
directory: infrastructure/
framework: terraform
Drift Detection
Drift happens when someone modifies infrastructure outside of Terraform β through a web console, a manual CLI command, or an automated process. The state file no longer matches reality.
Detect drift regularly:
# refresh state and show differences
terraform refresh
terraform plan # if plan shows changes when no HCL changed, drift has occurred
In CI, run a scheduled plan on production to detect drift:
# .github/workflows/drift-detection.yaml
on:
schedule:
- cron: "0 9 * * 1-5" # Monday to Friday at 9 AM
jobs:
detect-drift:
runs-on: ubuntu-latest
steps:
# ... init steps ...
- name: Detect Drift
run: |
terraform plan -detailed-exitcode
# exit code 0 = no changes (no drift)
# exit code 1 = error
# exit code 2 = changes detected (drift)
Level 8: Scaling to Multiple Teams
When multiple teams own different parts of the infrastructure, sharing a single Terraform configuration creates ownership conflicts, blast radius problems, and slow CI pipelines.
State Splitting by Domain
Split state along ownership boundaries, not just environments:
state files:
networking/staging/terraform.tfstate β VPCs, subnets, DNS zones
networking/production/terraform.tfstate
platform/staging/terraform.tfstate β K3s clusters, databases
platform/production/terraform.tfstate
app-team-a/staging/terraform.tfstate β team A's resources
app-team-a/production/terraform.tfstate
app-team-b/staging/terraform.tfstate
app-team-b/production/terraform.tfstate
Teams apply their own state independently. The networking team owns the VPCs. The platform team owns the clusters. App teams own their own deployments within the cluster.
Remote State References
When one configuration needs values from another, use terraform_remote_state instead of duplicating outputs or hardcoding IDs:
# platform team reads VPC IDs from networking team's state
data "terraform_remote_state" "networking" {
backend = "s3"
config = {
bucket = "your-org-terraform-state"
key = "networking/staging/terraform.tfstate"
region = "us-east-1"
}
}
resource "aws_eks_cluster" "main" {
vpc_config {
subnet_ids = data.terraform_remote_state.networking.outputs.private_subnet_ids
}
}
Terragrunt for DRY Multi-Environment Configs
When separate directories lead to too much repeated backend and provider configuration, terragrunt adds a thin wrapper that eliminates the boilerplate:
# terragrunt.hcl β root config
remote_state {
backend = "s3"
config = {
bucket = "your-org-terraform-state"
key = "${path_relative_to_include()}/terraform.tfstate"
region = "us-east-1"
dynamodb_table = "terraform-state-locks"
encrypt = true
}
}
generate "provider" {
path = "provider.tf"
if_exists = "overwrite_terragrunt"
contents = <<EOF
provider "aws" {
region = "us-east-1"
}
EOF
}
# environments/staging/terragrunt.hcl
include "root" {
path = find_in_parent_folders()
}
inputs = {
environment = "staging"
instance_type = "t3.small"
}
With Terragrunt, terragrunt run-all plan runs plans across all environments in dependency order.
The Infrastructure Contract
Terraform is not just a deployment tool. It is a contract between your team and your infrastructure. Everything in that contract is explicit, versioned, and reviewable. Everything outside it is a liability.
The discipline that matters is not knowing every Terraform command β it is never making a change to production infrastructure that is not represented in the code. Not once. The moment you log into a console and click something, you have broken the contract. The state drifts. The next plan is wrong. The next apply is unpredictable.
The habit of reaching for a pull request instead of a web console is what separates infrastructure that is maintainable from infrastructure that is held hostage by whoever provisioned it last.
You do not own your infrastructure if you cannot recreate it from code. Version control is not just for applications β it is for everything that can break in production.
Tags
Related Articles
Automation with Go: Building Scalable, Concurrent Systems for Real-World Tasks
Master Go for automation. Learn to build fast, concurrent automation tools, CLI utilities, monitoring systems, and deployment pipelines. Go's concurrency model makes it perfect for real-world automation.
Automation Tools for Developers: Real Workflows Without AI - CLI, Scripts & Open Source
Master free automation tools for developers. Learn to automate repetitive tasks, workflows, deployments, monitoring, and operations. Build custom automation pipelines with open-source toolsβno AI needed.
Data Analysis for Backend Engineers: Using Metrics to Make Better Technical Decisions
Master data analysis as a backend engineer. Learn to collect meaningful metrics, analyze performance data, avoid common pitfalls, and make technical decisions backed by evidence instead of hunches.