Terraform: The Complete Guide from First Resource to Team IaC

Terraform: The Complete Guide from First Resource to Team IaC

Master Terraform from zero to team-scale: providers, state, modules, workspaces, remote backends, CI/CD pipelines, and how IaC fits into an agile sprint workflow.

By Omar Flores

Think about the difference between a restaurant that writes its recipes down and one where the head chef keeps everything in their head. The second restaurant is held hostage by a single person. The moment that person is sick, on vacation, or simply gone, the kitchen stops working. Nobody else can reproduce what they make. The knowledge is not in the system β€” it is in one person’s memory.

Most infrastructure teams start like that second restaurant. Servers provisioned through web consoles, configurations applied by hand, the exact steps known only to whoever was at the keyboard that day. It works until something breaks, until someone leaves, until you need a second copy of the environment, until an audit asks you to prove what changed and when.

Terraform is the discipline of writing the recipe down. Not as documentation that gets stale, but as executable code that provisions the exact infrastructure you describe, every time, on any cloud, with a record of every change.


What Terraform Actually Does

Before touching code, the mental model matters. Terraform does three things:

It declares desired state. You describe the infrastructure you want in HCL (HashiCorp Configuration Language). A server with 2 CPUs, a firewall with specific rules, a DNS record pointing at that server’s IP. Terraform does not tell it how to create these β€” it tells it what they should look like when done.

It computes a plan. Terraform compares your declared state against the current real state of your infrastructure and computes the minimal set of changes needed to reach the desired state. Create this, modify that, destroy the other thing. You review the plan before anything is applied.

It applies changes and records them. When you approve the plan, Terraform executes the API calls to create, modify, or destroy resources. After each run, it records the resulting state in a state file. That file is what Terraform uses to compare against on the next run.

The workflow is always the same: write β†’ terraform plan β†’ review β†’ terraform apply. No exceptions in production.


Who Uses Terraform and How

Terraform serves different roles depending on your position in the team. Understanding this prevents the common mistake of treating it as a purely senior-engineer tool.

Junior developers use Terraform in read mode first. They run terraform plan to understand what changes a PR will make. They read existing modules to understand how the environment is structured. They add environment variables, update DNS records, resize resources within existing modules. The goal is comfort with the workflow before writing new infrastructure.

Mid-level engineers write new resources within an existing structure, create module calls, update provider versions, and manage workspace-specific configurations. They review plans before production applies and participate in incident response when infrastructure changes cause issues.

Senior DevOps engineers design the module hierarchy, set remote backend configuration, define the team workflow (branch policies, CI/CD integration, review requirements), manage provider upgrades, and own the state. They write the recipes that others follow.

Tech leads and architects define which clouds and which services, make buy-vs-build decisions on modules (write your own vs use the Terraform Registry), set the standards for naming conventions, tagging, and resource lifecycle policies.


Level 1: Your First Terraform Project

Start with the smallest possible thing that is still real: a single server with a firewall.

Installation

# on macOS
brew tap hashicorp/tap
brew install hashicorp/tap/terraform

# on Linux (Debian/Ubuntu)
wget -O - https://apt.releases.hashicorp.com/gpg | sudo gpg --dearmor -o /usr/share/keyrings/hashicorp-archive-keyring.gpg
echo "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/hashicorp-archive-keyring.gpg] https://apt.releases.hashicorp.com $(lsb_release -cs) main" | sudo tee /etc/apt/sources.list.d/hashicorp.list
sudo apt update && sudo apt install terraform

# verify
terraform version

Project Structure

my-first-infra/
β”œβ”€β”€ main.tf          # resources
β”œβ”€β”€ variables.tf     # input variable declarations
β”œβ”€β”€ outputs.tf       # output value declarations
└── terraform.tfvars # actual variable values (never commit secrets here)

Provider Configuration

A provider is a plugin that knows how to talk to a specific API β€” AWS, GCP, Hetzner, Cloudflare, GitHub, etc. You declare which providers you need and Terraform downloads them.

# main.tf
terraform {
  required_version = ">= 1.7.0"

  required_providers {
    hcloud = {
      source  = "hetznercloud/hcloud"
      version = "~> 1.45"   # ~> means >= 1.45.0, < 2.0.0
    }
  }
}

provider "hcloud" {
  token = var.hcloud_token
}

Your First Resource

# main.tf β€” continued
resource "hcloud_server" "web" {
  name        = "web-${var.environment}"
  server_type = "cx22"            # 2 vCPU, 4GB RAM
  image       = "ubuntu-24.04"
  location    = "nbg1"            # Nuremberg
  ssh_keys    = [hcloud_ssh_key.default.id]

  labels = {
    environment = var.environment
    managed-by  = "terraform"
  }
}

resource "hcloud_ssh_key" "default" {
  name       = "deploy-key"
  public_key = file("~/.ssh/id_ed25519.pub")
}

resource "hcloud_firewall" "web" {
  name = "web-${var.environment}"

  rule {
    direction  = "in"
    protocol   = "tcp"
    port       = "22"
    source_ips = ["0.0.0.0/0", "::/0"]
  }

  rule {
    direction  = "in"
    protocol   = "tcp"
    port       = "443"
    source_ips = ["0.0.0.0/0", "::/0"]
  }
}

resource "hcloud_firewall_attachment" "web" {
  firewall_id = hcloud_firewall.web.id
  server_ids  = [hcloud_server.web.id]
}

Variables

# variables.tf
variable "hcloud_token" {
  type        = string
  sensitive   = true
  description = "Hetzner Cloud API token"
}

variable "environment" {
  type        = string
  description = "staging or production"

  validation {
    condition     = contains(["staging", "production"], var.environment)
    error_message = "environment must be staging or production"
  }
}
# terraform.tfvars β€” values for local use
# DO NOT commit this file. Add to .gitignore.
hcloud_token = "your-api-token-here"
environment  = "staging"

Outputs

# outputs.tf
output "server_ip" {
  value       = hcloud_server.web.ipv4_address
  description = "Public IPv4 address of the web server"
}

output "server_id" {
  value = hcloud_server.web.id
}

The First Run

# initialize β€” downloads providers, sets up backend
terraform init

# preview what will be created
terraform plan

# apply the plan
terraform apply

# after apply, inspect the state
terraform show

# view specific outputs
terraform output server_ip

# when you are done with the resource
terraform destroy

The first terraform plan output teaches you to read Terraform. Resources prefixed with + will be created. Those with ~ will be modified in place. Those with -/+ will be destroyed and recreated (because the change cannot be made in place). Those with - will be destroyed. Reviewing this output carefully is the single most important habit in Terraform.


Level 2: State Management

State is what makes Terraform different from a script. It is also what makes it dangerous when misunderstood.

The state file (terraform.tfstate) records the mapping between your HCL declarations and the real resource IDs in the provider. Without it, Terraform cannot know that hcloud_server.web corresponds to server ID 12345678 in Hetzner. It would try to create a new one on every run.

What State Contains

{
  "version": 4,
  "terraform_version": "1.7.3",
  "resources": [
    {
      "mode": "managed",
      "type": "hcloud_server",
      "name": "web",
      "provider": "provider[\"registry.terraform.io/hetznercloud/hcloud\"]",
      "instances": [
        {
          "schema_version": 0,
          "attributes": {
            "id": "12345678",
            "name": "web-staging",
            "ipv4_address": "1.2.3.4",
            "status": "running"
            // ... all resource attributes
          }
        }
      ]
    }
  ]
}

The Problem with Local State

Local state (terraform.tfstate on your disk) fails immediately when:

  • Two engineers run terraform apply simultaneously β€” state conflict, one overwrites the other
  • Your laptop breaks β€” state is gone, Terraform cannot reconcile with real infrastructure
  • You need to run Terraform from CI/CD β€” no access to your local file

The solution is a remote backend with state locking.

Remote Backend: S3 + DynamoDB

terraform {
  backend "s3" {
    bucket         = "your-org-terraform-state"
    key            = "projects/my-app/staging/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true
    dynamodb_table = "terraform-state-locks"   # for locking
  }
}

Create the S3 bucket and DynamoDB table once (manually or with a bootstrap Terraform config):

# bootstrap/main.tf β€” run this once to create the state backend
resource "aws_s3_bucket" "terraform_state" {
  bucket = "your-org-terraform-state"
}

resource "aws_s3_bucket_versioning" "terraform_state" {
  bucket = aws_s3_bucket.terraform_state.id
  versioning_configuration {
    status = "Enabled"   # versioning gives you state history and rollback
  }
}

resource "aws_s3_bucket_server_side_encryption_configuration" "terraform_state" {
  bucket = aws_s3_bucket.terraform_state.id
  rule {
    apply_server_side_encryption_by_default {
      sse_algorithm = "AES256"
    }
  }
}

resource "aws_dynamodb_table" "terraform_locks" {
  name         = "terraform-state-locks"
  billing_mode = "PAY_PER_REQUEST"
  hash_key     = "LockID"

  attribute {
    name = "LockID"
    type = "S"
  }
}

With this backend, when anyone runs terraform apply:

  1. Terraform acquires a lock in DynamoDB β€” no other process can apply simultaneously
  2. It reads the latest state from S3
  3. It applies the plan
  4. It writes the new state to S3
  5. It releases the lock

Terraform Cloud / HCP Terraform (Alternative)

HashiCorp’s managed backend handles remote state, locking, and plan execution with a web UI and API. Free tier covers most small teams. Configure with:

terraform {
  cloud {
    organization = "your-org"
    workspaces {
      name = "my-app-staging"
    }
  }
}

Level 3: Modules

A module is a reusable unit of infrastructure. It is the difference between copying and pasting the same server + firewall + DNS record configuration four times for four environments versus defining it once and calling it four times with different parameters.

When to Write a Module

Write a module when you have two or more similar resource groups that differ only in their inputs. A single resource is not a module. Three identical server setups with different names are.

Module Structure

modules/
  server/
    main.tf        # the resources
    variables.tf   # inputs
    outputs.tf     # what callers can reference
    versions.tf    # required provider versions
  database/
    main.tf
    variables.tf
    outputs.tf
    versions.tf
environments/
  staging/
    main.tf        # calls modules
    variables.tf
    outputs.tf
    backend.tf
  production/
    main.tf
    variables.tf
    outputs.tf
    backend.tf

Writing a Module

# modules/server/variables.tf
variable "name" {
  type        = string
  description = "Server name"
}

variable "server_type" {
  type        = string
  default     = "cx22"
  description = "Hetzner server type"
}

variable "environment" {
  type = string
}

variable "allowed_ports" {
  type        = list(number)
  default     = [80, 443]
  description = "Inbound TCP ports to allow"
}
# modules/server/main.tf
resource "hcloud_server" "this" {
  name        = var.name
  server_type = var.server_type
  image       = "ubuntu-24.04"
  location    = "nbg1"

  labels = {
    environment = var.environment
    managed-by  = "terraform"
  }
}

resource "hcloud_firewall" "this" {
  name = "${var.name}-firewall"

  dynamic "rule" {
    for_each = var.allowed_ports
    content {
      direction  = "in"
      protocol   = "tcp"
      port       = tostring(rule.value)
      source_ips = ["0.0.0.0/0", "::/0"]
    }
  }
}

resource "hcloud_firewall_attachment" "this" {
  firewall_id = hcloud_firewall.this.id
  server_ids  = [hcloud_server.this.id]
}
# modules/server/outputs.tf
output "ipv4_address" {
  value = hcloud_server.this.ipv4_address
}

output "server_id" {
  value = hcloud_server.this.id
}

Calling the Module

# environments/staging/main.tf
module "web" {
  source = "../../modules/server"

  name          = "web-staging"
  environment   = "staging"
  server_type   = "cx22"
  allowed_ports = [80, 443]
}

module "api" {
  source = "../../modules/server"

  name          = "api-staging"
  environment   = "staging"
  server_type   = "cx32"
  allowed_ports = [8080]
}

# reference module outputs
output "web_ip" {
  value = module.web.ipv4_address
}

Using Registry Modules

The Terraform Registry has verified modules for common patterns β€” VPCs, EKS clusters, RDS instances. Use them instead of reinventing.

# use the official AWS VPC module
module "vpc" {
  source  = "terraform-aws-modules/vpc/aws"
  version = "5.5.1"   # always pin the version

  name = "main-vpc"
  cidr = "10.0.0.0/16"

  azs             = ["us-east-1a", "us-east-1b", "us-east-1c"]
  private_subnets = ["10.0.1.0/24", "10.0.2.0/24", "10.0.3.0/24"]
  public_subnets  = ["10.0.101.0/24", "10.0.102.0/24", "10.0.103.0/24"]

  enable_nat_gateway = true
  single_nat_gateway = true   # cost optimization β€” use one NAT gateway

  tags = {
    Environment = var.environment
    ManagedBy   = "terraform"
  }
}

Level 4: Multi-Environment Workflows

The most common mistake in Terraform is treating staging and production as the same configuration with different variable values. They are not. They can diverge significantly in resource types, sizes, replica counts, and external integrations. The right model separates them structurally.

Separate Directories Per Environment

environments/
  staging/
    backend.tf      # backend "s3" { key = "staging/terraform.tfstate" }
    main.tf         # module calls with staging-specific inputs
    variables.tf
    terraform.tfvars
  production/
    backend.tf      # backend "s3" { key = "production/terraform.tfstate" }
    main.tf         # module calls with production-specific inputs
    variables.tf
    terraform.tfvars

This means separate state files, separate plan/apply cycles, and no risk of accidentally destroying production while targeting staging.

Workspaces (When to Use Them and When Not To)

Terraform workspaces allow multiple state files within a single backend configuration. They look appealing as a β€œone config, multiple environments” solution:

terraform workspace new staging
terraform workspace new production
terraform workspace select staging
terraform apply
# conditional resource sizing based on workspace
resource "hcloud_server" "web" {
  server_type = terraform.workspace == "production" ? "cx32" : "cx22"
}

Workspaces work well for truly identical environments that differ only in scale β€” ephemeral feature branch environments, for example. They work poorly for staging vs production because the conditional logic accumulates into unreadable configuration. Use separate directories for long-lived, architecturally different environments. Use workspaces for short-lived, structurally identical ones.

Variable Files Per Environment

When using a shared configuration with tfvars files:

# apply with environment-specific variables
terraform apply -var-file="staging.tfvars"
terraform apply -var-file="production.tfvars"
# staging.tfvars
environment  = "staging"
server_type  = "cx22"
replica_count = 1
enable_backups = false
database_size  = "small"

# production.tfvars
environment  = "production"
server_type  = "cx32"
replica_count = 3
enable_backups = true
database_size  = "large"

Level 5: Advanced HCL Patterns

for_each and count

count creates multiple identical copies of a resource. for_each creates one resource per item in a map or set, with each resource independently addressable.

# count β€” creates 3 servers: server[0], server[1], server[2]
resource "hcloud_server" "worker" {
  count       = var.worker_count
  name        = "worker-${count.index}"
  server_type = "cx22"
  image       = "ubuntu-24.04"
}

# for_each β€” creates servers with meaningful addresses: server["web"], server["api"]
resource "hcloud_server" "services" {
  for_each = {
    web = { type = "cx22", location = "nbg1" }
    api = { type = "cx32", location = "fsn1" }
  }

  name        = each.key
  server_type = each.value.type
  location    = each.value.location
  image       = "ubuntu-24.04"
}

# reference: hcloud_server.services["web"].ipv4_address

Prefer for_each over count for anything that is not a simple quantity. With count, removing an element from the middle renumbers all subsequent elements, which causes Terraform to destroy and recreate them. With for_each, each element has a stable key.

Dynamic Blocks

variable "security_group_rules" {
  type = list(object({
    port     = number
    protocol = string
    cidr     = string
  }))
}

resource "hcloud_firewall" "main" {
  name = "main"

  dynamic "rule" {
    for_each = var.security_group_rules
    content {
      direction  = "in"
      protocol   = rule.value.protocol
      port       = tostring(rule.value.port)
      source_ips = [rule.value.cidr]
    }
  }
}

Local Values

Locals reduce repetition and make expressions readable:

locals {
  common_tags = {
    environment = var.environment
    managed-by  = "terraform"
    project     = "my-app"
    team        = "platform"
  }

  server_name = "${var.project}-${var.environment}-${var.region}"

  is_production = var.environment == "production"
}

resource "hcloud_server" "web" {
  name   = local.server_name
  labels = local.common_tags
}

Data Sources

Data sources read existing infrastructure that Terraform did not create. They let you reference resources managed elsewhere β€” by another team, another Terraform config, or manually.

# read an existing SSH key by name (created by another team)
data "hcloud_ssh_key" "ops_team" {
  name = "ops-team-2025"
}

# use it in a new resource
resource "hcloud_server" "web" {
  ssh_keys = [data.hcloud_ssh_key.ops_team.id]
}
# read the latest Ubuntu image ID dynamically
data "hcloud_image" "ubuntu_24" {
  name = "ubuntu-24.04"
  type = "system"
}

resource "hcloud_server" "web" {
  image = data.hcloud_image.ubuntu_24.id   # always the latest ubuntu-24.04
}

depends_on

Terraform builds a dependency graph from references between resources. Explicit depends_on is only needed when a dependency exists that Terraform cannot infer from resource references:

resource "hcloud_server" "app" {
  # ...
  depends_on = [hcloud_firewall_attachment.base]
  # needed because app must start after firewall is attached,
  # but the app resource has no direct reference to the firewall
}

Use depends_on sparingly. If you find yourself needing it often, it usually means the resource references are not structured correctly.


Level 6: Terraform in Agile Teams

The Terraform Sprint Cycle

In a sprint-based team, infrastructure changes follow a rhythm that must integrate with the feature development cycle without becoming a bottleneck.

The two failure modes are: infrastructure changes blocking feature work (infra team is slow, PR reviews take days) and feature work breaking infrastructure (developers merge infra changes without review).

The solution is treating infrastructure as code with the same review standards as application code, but with one additional step: the plan output is part of the PR review.

Sprint planning β†’ infrastructure tasks created alongside feature tasks
      ↓
Developer creates infra branch β†’ opens PR β†’ CI runs terraform plan
      ↓
PR includes plan output in comments β†’ reviewer reads plan, not just code
      ↓
Merge to main β†’ CI applies to staging automatically
      ↓
Sprint review β†’ staging is the demo environment
      ↓
Release β†’ CI applies to production with manual approval gate

GitHub Actions: Automated Plan on PR

# .github/workflows/terraform.yaml
name: Terraform

on:
  pull_request:
    paths:
      - "infrastructure/**"
  push:
    branches: [main]
    paths:
      - "infrastructure/**"

env:
  TF_VERSION: "1.7.3"
  AWS_REGION: "us-east-1"

jobs:
  plan:
    name: Plan
    runs-on: ubuntu-latest
    if: github.event_name == 'pull_request'
    permissions:
      contents: read
      pull-requests: write
      id-token: write   # for OIDC auth to AWS

    steps:
      - uses: actions/checkout@v4

      - uses: hashicorp/setup-terraform@v3
        with:
          terraform_version: ${{ env.TF_VERSION }}

      - name: Configure AWS credentials (OIDC)
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::123456789:role/TerraformCIReadOnly
          aws-region: ${{ env.AWS_REGION }}

      - name: Terraform Init
        working-directory: infrastructure/environments/staging
        run: terraform init

      - name: Terraform Plan
        id: plan
        working-directory: infrastructure/environments/staging
        run: |
          terraform plan -no-color -out=tfplan 2>&1 | tee plan.txt
          echo "exitcode=${PIPESTATUS[0]}" >> $GITHUB_OUTPUT

      - name: Post Plan to PR
        uses: actions/github-script@v7
        with:
          script: |
            const fs = require('fs');
            const plan = fs.readFileSync('infrastructure/environments/staging/plan.txt', 'utf8');
            const truncated = plan.length > 60000 ? plan.slice(0, 60000) + '\n\n... truncated ...' : plan;
            const body = `## Terraform Plan β€” staging\n\`\`\`\n${truncated}\n\`\`\``;
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body
            });

  apply-staging:
    name: Apply to Staging
    runs-on: ubuntu-latest
    if: github.event_name == 'push' && github.ref == 'refs/heads/main'
    environment: staging
    permissions:
      contents: read
      id-token: write

    steps:
      - uses: actions/checkout@v4

      - uses: hashicorp/setup-terraform@v3
        with:
          terraform_version: ${{ env.TF_VERSION }}

      - name: Configure AWS credentials
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::123456789:role/TerraformCIApply
          aws-region: ${{ env.AWS_REGION }}

      - name: Terraform Init
        working-directory: infrastructure/environments/staging
        run: terraform init

      - name: Terraform Apply
        working-directory: infrastructure/environments/staging
        run: terraform apply -auto-approve

  apply-production:
    name: Apply to Production
    runs-on: ubuntu-latest
    if: github.event_name == 'push' && github.ref == 'refs/heads/main'
    needs: [apply-staging]
    environment: production   # requires manual approval in GitHub Environments
    permissions:
      contents: read
      id-token: write

    steps:
      - uses: actions/checkout@v4

      - uses: hashicorp/setup-terraform@v3
        with:
          terraform_version: ${{ env.TF_VERSION }}

      - name: Configure AWS credentials
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::123456789:role/TerraformCIApply
          aws-region: ${{ env.AWS_REGION }}

      - name: Terraform Init
        working-directory: infrastructure/environments/production
        run: terraform init

      - name: Terraform Apply
        working-directory: infrastructure/environments/production
        run: terraform apply -auto-approve

OIDC Authentication (No Long-Lived Credentials)

The CI pipeline above uses OIDC to assume an AWS IAM role instead of storing AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY as secrets. This is the correct approach. Long-lived keys get rotated infrequently, leak into logs, and have broad permissions. OIDC tokens are short-lived and scoped to the specific GitHub repository and branch.

// IAM Trust Policy for the TerraformCIReadOnly role
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Federated": "arn:aws:iam::123456789:oidc-provider/token.actions.githubusercontent.com"
      },
      "Action": "sts:AssumeRoleWithWebIdentity",
      "Condition": {
        "StringEquals": {
          "token.actions.githubusercontent.com:aud": "sts.amazonaws.com"
        },
        "StringLike": {
          "token.actions.githubusercontent.com:sub": "repo:your-org/your-repo:pull_request"
        }
      }
    }
  ]
}

Two roles, two trust policies, two permission sets:

  • TerraformCIReadOnly β€” triggered on pull_request, can only read state and run plans (no write permissions to infrastructure)
  • TerraformCIApply β€” triggered on push to main, has write permissions, only trusted for the main branch

Team Review Conventions

The PR checklist for infrastructure changes:

## Infrastructure PR Checklist

- [ ] `terraform fmt` was run β€” code is formatted
- [ ] `terraform validate` passes
- [ ] Plan output is attached and reviewed
- [ ] Plan shows no unexpected destroys
- [ ] Variable changes are documented in the PR description
- [ ] Secrets are not in the plan output
- [ ] The change has been verified in a local or branch environment first
- [ ] Rollback plan is documented if the change is destructive

Handling Destructive Changes

Terraform will sometimes propose to destroy and recreate a resource when an in-place update is not possible. Common examples: renaming a server (name is immutable), changing an image type (requires new VM), modifying certain database parameters.

Never let a destructive change to a production database or server merge silently. Protect critical resources:

resource "hcloud_server" "web" {
  # ...

  lifecycle {
    prevent_destroy = true   # terraform apply will error if a destroy is planned

    ignore_changes = [
      labels,    # ignore label changes managed externally
    ]

    create_before_destroy = true   # create the replacement before destroying the old one
  }
}

create_before_destroy is essential for servers behind a load balancer β€” it ensures the new server is healthy before the old one is removed, preventing downtime.


Level 7: Security and Compliance

Least Privilege for CI

The CI role should have exactly the permissions it needs. For a Hetzner + Cloudflare stack, generate a read-only API token for plans and a write token for applies β€” never the full admin token.

For AWS, scope the IAM policy to the specific resources Terraform manages:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "ec2:Describe*",
        "ec2:CreateInstance",
        "ec2:TerminateInstances",
        "ec2:ModifyInstanceAttribute",
        "s3:GetObject",
        "s3:PutObject",
        "s3:ListBucket",
        "dynamodb:GetItem",
        "dynamodb:PutItem",
        "dynamodb:DeleteItem"
      ],
      "Resource": [
        "arn:aws:ec2:*:*:instance/*",
        "arn:aws:s3:::your-org-terraform-state/*",
        "arn:aws:dynamodb:*:*:table/terraform-state-locks"
      ]
    }
  ]
}

Static Analysis with tfsec and checkov

Run security analysis on your Terraform code before applying:

# tfsec β€” scans for security misconfigurations
brew install tfsec
tfsec .

# checkov β€” broader policy-as-code, covers Terraform + Kubernetes + Dockerfiles
pip install checkov
checkov -d .

Add both to the CI plan job:

- name: Run tfsec
  uses: aquasecurity/tfsec-action@v1.0.0
  with:
    working_directory: infrastructure/

- name: Run checkov
  uses: bridgecrewio/checkov-action@v12
  with:
    directory: infrastructure/
    framework: terraform

Drift Detection

Drift happens when someone modifies infrastructure outside of Terraform β€” through a web console, a manual CLI command, or an automated process. The state file no longer matches reality.

Detect drift regularly:

# refresh state and show differences
terraform refresh
terraform plan   # if plan shows changes when no HCL changed, drift has occurred

In CI, run a scheduled plan on production to detect drift:

# .github/workflows/drift-detection.yaml
on:
  schedule:
    - cron: "0 9 * * 1-5"   # Monday to Friday at 9 AM

jobs:
  detect-drift:
    runs-on: ubuntu-latest
    steps:
      # ... init steps ...
      - name: Detect Drift
        run: |
          terraform plan -detailed-exitcode
          # exit code 0 = no changes (no drift)
          # exit code 1 = error
          # exit code 2 = changes detected (drift)

Level 8: Scaling to Multiple Teams

When multiple teams own different parts of the infrastructure, sharing a single Terraform configuration creates ownership conflicts, blast radius problems, and slow CI pipelines.

State Splitting by Domain

Split state along ownership boundaries, not just environments:

state files:
  networking/staging/terraform.tfstate      ← VPCs, subnets, DNS zones
  networking/production/terraform.tfstate
  platform/staging/terraform.tfstate        ← K3s clusters, databases
  platform/production/terraform.tfstate
  app-team-a/staging/terraform.tfstate      ← team A's resources
  app-team-a/production/terraform.tfstate
  app-team-b/staging/terraform.tfstate
  app-team-b/production/terraform.tfstate

Teams apply their own state independently. The networking team owns the VPCs. The platform team owns the clusters. App teams own their own deployments within the cluster.

Remote State References

When one configuration needs values from another, use terraform_remote_state instead of duplicating outputs or hardcoding IDs:

# platform team reads VPC IDs from networking team's state
data "terraform_remote_state" "networking" {
  backend = "s3"
  config = {
    bucket = "your-org-terraform-state"
    key    = "networking/staging/terraform.tfstate"
    region = "us-east-1"
  }
}

resource "aws_eks_cluster" "main" {
  vpc_config {
    subnet_ids = data.terraform_remote_state.networking.outputs.private_subnet_ids
  }
}

Terragrunt for DRY Multi-Environment Configs

When separate directories lead to too much repeated backend and provider configuration, terragrunt adds a thin wrapper that eliminates the boilerplate:

# terragrunt.hcl β€” root config
remote_state {
  backend = "s3"
  config = {
    bucket         = "your-org-terraform-state"
    key            = "${path_relative_to_include()}/terraform.tfstate"
    region         = "us-east-1"
    dynamodb_table = "terraform-state-locks"
    encrypt        = true
  }
}

generate "provider" {
  path      = "provider.tf"
  if_exists = "overwrite_terragrunt"
  contents  = <<EOF
provider "aws" {
  region = "us-east-1"
}
EOF
}
# environments/staging/terragrunt.hcl
include "root" {
  path = find_in_parent_folders()
}

inputs = {
  environment   = "staging"
  instance_type = "t3.small"
}

With Terragrunt, terragrunt run-all plan runs plans across all environments in dependency order.


The Infrastructure Contract

Terraform is not just a deployment tool. It is a contract between your team and your infrastructure. Everything in that contract is explicit, versioned, and reviewable. Everything outside it is a liability.

The discipline that matters is not knowing every Terraform command β€” it is never making a change to production infrastructure that is not represented in the code. Not once. The moment you log into a console and click something, you have broken the contract. The state drifts. The next plan is wrong. The next apply is unpredictable.

The habit of reaching for a pull request instead of a web console is what separates infrastructure that is maintainable from infrastructure that is held hostage by whoever provisioned it last.

You do not own your infrastructure if you cannot recreate it from code. Version control is not just for applications β€” it is for everything that can break in production.

Tags

#devops #guide #best-practices #ci-cd #tutorial #senior #backend