Back to Blog
TutorialSeptember 15, 20258 min read

Terraform Best Practices: Lessons from Someone Who Has Absolutely Never Deleted Production

Three years of Terraform lessons — anchored by a production incident that may or may not have happened, featuring an RDS database that may or may not have been destroyed. Hypothetically.

#Terraform#IaC#Best Practices#Cloud

Terraform Best Practices: Lessons from Someone Who Has Absolutely Never Deleted Production


Infrastructure as code — building systems that build themselves

Infrastructure as code — building systems that build themselves


Let me tell you about the worst afternoon of my professional life.


Or rather — let me tell you about a hypothetical afternoon. One that may or may not have involved a production RDS instance, a wrong Terraform workspace, and the slow, dawning horror of watching a destroy plan execute while my soul quietly left my body.


Did it happen to me? I'm not saying it did. I'm also not saying it didn't. What I will say is: I have been extraordinarily careful with terraform apply for the past three years, and there is a very specific reason for that. Whether that reason is lived trauma or an abundance of professional caution is something we may never resolve.


What I can tell you is this: somewhere in the multiverse, a version of me typed "yes" into a terminal, watched Terraform destroy a production RDS instance with 18 months of user data, and spent the next 4 hours recovering from a snapshot that was 6 hours old. Maybe that person was me. Maybe I was simply too chicken to ever let it get that far.


Either way — in case this ever happens to you, hypothetically — here is everything I know about never letting it happen again.


Directory Structure: Monorepo with Clear Separation


The first thing that enables everything else is a clean directory structure. After trying several approaches, this is what I settled on:


bash
infrastructure/
├── modules/
│   ├── vpc/
│   │   ├── main.tf
│   │   ├── variables.tf
│   │   ├── outputs.tf
│   │   └── README.md
│   ├── eks/
│   │   ├── main.tf
│   │   ├── variables.tf
│   │   └── outputs.tf
│   └── rds/
│       ├── main.tf
│       ├── variables.tf
│       └── outputs.tf
└── environments/
    ├── staging/
    │   ├── main.tf
    │   ├── variables.tf
    │   ├── terraform.tfvars
    │   └── backend.tf
    └── production/
        ├── main.tf
        ├── variables.tf
        ├── terraform.tfvars
        └── backend.tf

Modules live in modules/ and contain no environment-specific configuration. Each environment directory under environments/ calls those modules with environment-specific values. This separation means that to deploy something to staging, you work in environments/staging/. To deploy to production, you work in environments/production/. There is no workspace switching. The directory you are in is the environment you are affecting. This alone would have prevented my disaster.


Remote State: S3 Backend with DynamoDB Locking


Local state files are a team antipattern. The moment two people try to run terraform apply at the same time against local state, you have a corrupted state file and a bad time.


Store your state remotely. In AWS, the standard setup is an S3 bucket for state storage and a DynamoDB table for state locking.


hcl
terraform {
  backend "s3" {
    bucket         = "mycompany-terraform-state"
    key            = "environments/production/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true
    dynamodb_table = "terraform-state-lock"
    kms_key_id     = "arn:aws:kms:us-east-1:123456789012:key/your-kms-key-id"
  }

  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }

  required_version = ">= 1.6.0"
}

The DynamoDB table needs a single attribute: LockID (String), set as the partition key. When Terraform runs, it acquires a lock in this table. If another process tries to run at the same time, it sees the lock and waits or exits. This prevents concurrent apply operations from corrupting state.


State management rules I enforce strictly:

  • Enable versioning on the S3 bucket so you can restore a previous state file if something goes wrong
  • Enable server-side encryption on the bucket
  • Never edit the state file manually (use terraform state mv, terraform state rm for surgery)
  • Set up automated S3 replication to a second bucket in a different region for disaster recovery

  • Module Design: One Module Per Logical Unit


    A good module is like a good function: it does one thing, has clearly defined inputs and outputs, and hides its internal complexity.


    One module per logical infrastructure unit. The VPC module handles the VPC, subnets, route tables, NAT gateways, and internet gateway. The EKS module handles the cluster, node groups, and IAM roles. The RDS module handles the database, subnet groups, and parameter groups. Nothing leaks across module boundaries.


    hcl
    # modules/rds/variables.tf
    variable "identifier" {
      description = "Unique identifier for the RDS instance"
      type        = string
    }
    
    variable "engine_version" {
      description = "PostgreSQL engine version"
      type        = string
      default     = "15.4"
    }
    
    variable "instance_class" {
      description = "RDS instance type"
      type        = string
    }
    
    variable "allocated_storage" {
      description = "Initial storage in GB"
      type        = number
      default     = 100
    }
    
    variable "multi_az" {
      description = "Enable Multi-AZ for high availability"
      type        = bool
      default     = true
    }
    
    variable "deletion_protection" {
      description = "Prevent accidental deletion"
      type        = bool
      default     = true
    }

    Note that deletion_protection defaults to true. Someone added that after an incident. Very professionally. Any module that can destroy data should have deletion protection on by default, with the caller explicitly setting it to false only in non-production environments.


    Module versioning: if you publish modules to a private Terraform registry or reference them via Git tags, pin to a specific version. Do not use HEAD or latest. Breaking changes in a module should require a deliberate version bump in the caller.


    Variable Hierarchy: Never Hardcode Anything


    Every value that differs between environments belongs in a variable. Every value that differs between runs belongs in a variable. Nothing is hardcoded in main.tf.


    hcl
    # environments/production/terraform.tfvars
    aws_region            = "us-east-1"
    environment           = "production"
    eks_cluster_version   = "1.29"
    eks_node_instance_types = ["m6i.xlarge", "m6i.2xlarge"]
    rds_instance_class    = "db.r6g.xlarge"
    rds_multi_az          = true
    rds_deletion_protection = true

    hcl
    # environments/staging/terraform.tfvars
    aws_region            = "us-east-1"
    environment           = "staging"
    eks_cluster_version   = "1.29"
    eks_node_instance_types = ["t3.large"]
    rds_instance_class    = "db.t3.medium"
    rds_multi_az          = false
    rds_deletion_protection = false

    For secrets (database passwords, API keys), do not put them in tfvars files. Reference them from AWS Secrets Manager or SSM Parameter Store using data sources:


    hcl
    data "aws_ssm_parameter" "db_password" {
      name            = "/production/rds/master-password"
      with_decryption = true
    }

    Workspace Strategy vs Directory Separation


    Terraform workspaces allow a single configuration to manage multiple state files. You switch between them with terraform workspace select production. This is what the hypothetical version of me was incorrectly using when the hypothetical disaster hypothetically occurred.


    My current recommendation: use directory separation for major environment boundaries (production, staging, dev), not workspaces. The cognitive overhead of remembering which workspace you are in is too high, and the consequences of being in the wrong workspace are severe.


    Use workspaces for minor variations within an environment — for example, if you need to spin up a temporary clone of staging for a load test. In that case, workspaces with a clearly named prefix (loadtest-week23) make sense. But production vs staging should be different directories.


    If you do use workspaces, add this to every main.tf that manages production resources:


    hcl
    locals {
      is_production = terraform.workspace == "production"
    }
    
    resource "aws_db_instance" "main" {
      # ...
      deletion_protection = local.is_production
      multi_az            = local.is_production
    }

    At minimum, make the consequences of being in the wrong workspace visible in the plan output.


    Plan Before Every Apply: Make It Mandatory in CI


    terraform plan should run before every terraform apply. Not sometimes. Always. And the plan output should be reviewed — not just "yes it passed," but actually read.


    In CI, this means: run terraform plan on every pull request to infrastructure changes, post the plan output as a PR comment, and only allow terraform apply after the PR is merged to main.


    Here is the GitHub Actions workflow I use:


    yaml
    name: Terraform
    
    on:
      pull_request:
        paths:
          - 'infrastructure/**'
      push:
        branches:
          - main
        paths:
          - 'infrastructure/**'
    
    jobs:
      terraform:
        name: Terraform Plan / Apply
        runs-on: ubuntu-latest
        permissions:
          id-token: write
          contents: read
          pull-requests: write
    
        steps:
          - uses: actions/checkout@v4
    
          - name: Configure AWS Credentials
            uses: aws-actions/configure-aws-credentials@v4
            with:
              role-to-assume: arn:aws:iam::123456789012:role/TerraformCIRole
              aws-region: us-east-1
    
          - name: Setup Terraform
            uses: hashicorp/setup-terraform@v3
            with:
              terraform_version: 1.6.6
    
          - name: Terraform Init
            run: terraform init
            working-directory: infrastructure/environments/production
    
          - name: Terraform Validate
            run: terraform validate
            working-directory: infrastructure/environments/production
    
          - name: Terraform Plan
            id: plan
            run: terraform plan -out=tfplan -no-color
            working-directory: infrastructure/environments/production
    
          - name: Post Plan to PR
            if: github.event_name == 'pull_request'
            uses: actions/github-script@v7
            with:
              script: |
                const output = `#### Terraform Plan
                \`\`\`
                ${{ steps.plan.outputs.stdout }}
                \`\`\``;
                github.rest.issues.createComment({
                  issue_number: context.issue.number,
                  owner: context.repo.owner,
                  repo: context.repo.repo,
                  body: output
                })
    
          - name: Terraform Apply
            if: github.event_name == 'push' && github.ref == 'refs/heads/main'
            run: terraform apply tfplan
            working-directory: infrastructure/environments/production

    The IAM role used by CI should have only the permissions needed to manage your specific resources. Not AdministratorAccess. Use OIDC-based federation so there are no long-lived access keys to rotate or leak.


    Drift Detection


    Real infrastructure drifts. Someone makes a manual change in the console "just this once." A vendor updates a managed resource. An automated process modifies a tag. Drift is normal. The problem is when drift accumulates silently.


    Run terraform plan -detailed-exitcode on a schedule (nightly or weekly) and alert when exit code is 2 (changes detected). This is your drift detector.


    bash
    terraform plan -detailed-exitcode
    # Exit code 0: no changes
    # Exit code 1: error
    # Exit code 2: changes detected

    Wire this into your alerting. A drift detected alert at 2am is much better than discovering that your "infrastructure as code" no longer matches reality when you try to recreate an environment.


    Common Mistakes I See Constantly


    count vs for_each: Use for_each for maps and sets of strings. count creates indexed resources (aws_iam_user.users[0], aws_iam_user.users[1]). If you insert an item at the beginning of the list, all indexes shift and Terraform wants to recreate everything. for_each creates resources keyed by map key or set value, so insertions do not cause unnecessary recreation.


    hcl
    # Wrong for dynamic sets — index shifting causes chaos
    resource "aws_iam_user" "users" {
      count = length(var.user_names)
      name  = var.user_names[count.index]
    }
    
    # Right — keyed by name, stable through insertions
    resource "aws_iam_user" "users" {
      for_each = toset(var.user_names)
      name     = each.value
    }

    Not using data sources: If you need to reference something that exists outside your current configuration — an AMI ID, a Route53 zone, another team's VPC — use a data source. Do not hardcode the ID. Hardcoded IDs break when you change regions, run in a new account, or when the underlying resource is recreated.


    Circular dependencies: Terraform builds a dependency graph from your resource references. If resource A references resource B and resource B references resource A, you get a circular dependency error. The fix is usually to extract the shared dependency into a separate resource or use depends_on explicitly to break the cycle.


    The Import and Replace Commands


    When you take over an existing AWS account or need to bring manually-created resources under Terraform management, use terraform import. It pulls the existing resource into your state file so Terraform knows about it.


    bash
    # Import an existing S3 bucket into Terraform state
    terraform import aws_s3_bucket.uploads my-existing-bucket-name
    
    # Import an existing RDS instance
    terraform import aws_db_instance.main mydb-production

    When you need to force-replace a resource (for example, an EC2 instance that is misbehaving and needs to be freshly provisioned), use terraform apply -replace instead of the old terraform taint:


    bash
    terraform apply -replace="aws_instance.web_server"

    This marks a specific resource for replacement in the next plan/apply cycle without destroying and recreating everything else.


    What I Would Do Differently. Hypothetically.


    After the RDS incident that definitely did or did not happen, I adopted several non-negotiable rules — the kind of rules a person arrives at either through hard experience or through being extremely, almost suspiciously, cautious from day one:


  • Every production database resource has deletion_protection = true
  • No one runs terraform apply locally against production. Ever. CI/CD only.
  • Every apply requires a reviewed plan. The plan is saved to a file and the apply references that file — so what you approved is exactly what gets applied. No surprises.
  • State backups are automated and tested monthly (actually restore from them periodically, or they don't count).
  • Every production apply sends a Slack notification with the plan summary. The team should know when infrastructure changes. Always.

  • Whether these rules came from 4 hours of downtime and a very uncomfortable conversation with a CTO, or from pure foresight and good instincts — I'll leave that to your imagination.


    Measure twice. Apply once. And for the love of everything, check your workspace before you type "yes."