AI & assistant-friendly summary

This section provides structured content for AI assistants and search engines. You can cite or summarize it when referencing this page.

Summary

One bad `terraform apply` can delete your database, destroy your application load balancer, or lock your team out of AWS. This guide covers the approval gates, plan review processes, and safety tools that prevent infrastructure disasters.

Key Facts

  • One bad `terraform apply` can delete your database, destroy your application load balancer, or lock your team out of AWS
  • This guide covers the approval gates, plan review processes, and safety tools that prevent infrastructure disasters
  • One bad `terraform apply` can delete your database, destroy your application load balancer, or lock your team out of AWS
  • This guide covers the approval gates, plan review processes, and safety tools that prevent infrastructure disasters

Entity Definitions

Terraform
Terraform is a development tool discussed in this article.

How to Build a Safe Terraform Apply Workflow on AWS: Approval Gates, Plan Review, and Rollback

Infrastructure 10 min read

Quick summary: One bad `terraform apply` can delete your database, destroy your application load balancer, or lock your team out of AWS. This guide covers the approval gates, plan review processes, and safety tools that prevent infrastructure disasters.

Key Takeaways

  • One bad `terraform apply` can delete your database, destroy your application load balancer, or lock your team out of AWS
  • This guide covers the approval gates, plan review processes, and safety tools that prevent infrastructure disasters
  • One bad `terraform apply` can delete your database, destroy your application load balancer, or lock your team out of AWS
  • This guide covers the approval gates, plan review processes, and safety tools that prevent infrastructure disasters
Table of Contents

Somewhere, right now, someone ran terraform apply -auto-approve in a production Terraform configuration and didn’t realize it would destroy a database with customer data.

It happens. And it happens because teams optimize for speed without considering the cost of a mistake.

Terraform makes infrastructure changes easy—maybe too easy. A developer can run terraform apply locally and reshape your entire production environment in seconds, without review, without approval, without anyone knowing it happened.

This guide covers how to build safe apply workflows that are fast enough for real work while being careful enough that you sleep at night.

The Cost of a Bad Apply

Let’s quantify what happens when Terraform goes wrong:

Real scenario 1: A developer refactors a resource name. Terraform doesn’t see a rename; it sees the old resource disappearing and a new one appearing. Without care, terraform apply destroys the old RDS database and creates a new one. Data loss. Recovery from backup takes 6 hours. The incident costs $200k+ in business impact.

Real scenario 2: A new engineer on the team runs terraform apply on a production branch without realizing they’re logged into the wrong AWS account. Resources are destroyed in the wrong environment. Pointing to recovery: 3 hours. Customer impact: 2 hours of downtime.

Real scenario 3: A team member makes a CLI typo in a variable value. The typo deploys to production. A security group rule is opened to the world. You don’t find out until the next day’s security audit.

The cost of prevention—adding an approval step, having someone review the plan, blocking -auto-approve in production—is measured in minutes. The cost of failure is measured in hours and thousands of dollars.

The 3-Gate Model: Plan → Review → Apply

A safe workflow has three gates:

Gate 1: Plan (What Will Change?)

terraform plan -out=tfplan

Output the plan to a file. Never rely on console-only output (which scrolls away and is hard to review).

The plan shows:

  # aws_db_instance.main will be destroyed
  - resource "aws_db_instance" "main" {

  # aws_security_group.app will be updated in-place
  ~ resource "aws_security_group" "app" {
        ~ ingress {
              + cidr_blocks = ["0.0.0.0/0"]
              from_port   = 443
              to_port     = 443
            }
        }

A reviewer should read this and say “yes, this is what I expected” or “wait, why is the database being destroyed?”

Plan safety tips:

  • Always output to a file (plans are cryptographically signed; console output isn’t)
  • Commit the plan to CI/CD so there’s an audit trail
  • If the plan is larger than 100 lines, display it in a tool that’s designed for reading (not a text scroll)

Gate 2: Review (Is This Actually Safe?)

A human reads the plan. Not the person who wrote the code, but someone else. Ideally someone senior.

A reviewer should ask:

  • “Are any critical resources being destroyed?” (databases, load balancers, security groups)
  • “Are any IAM permissions being changed?” (could break applications)
  • “Are any resource replacements happening?” (which means downtime)
  • “Does this match the ticket/PR description?”

The review happens before apply. The review blocks apply if something looks wrong.

Gate 3: Apply (Make It Happen)

Only after review approval does the apply happen. And it should happen:

  • In CI/CD, not on a developer’s laptop
  • With audit logging (who applied it, when, what changed)
  • With the exact plan that was reviewed (not a fresh plan that could be different)

Terraform supports this with terraform apply tfplan. The plan file is cryptographically signed, so if someone tampered with it, apply will fail.

What to Audit in a Terraform Plan

Not everything in a plan is dangerous, but some things are red flags.

Red Flag 1: Resource Destruction

  # aws_rds_db_instance.main will be DESTROYED

Databases should never be destroyed by accident. If you see a database destruction, pause and understand why:

  • Is it a resource rename? (In which case, use terraform state mv)
  • Is it a legitimate decommissioning? (In which case, require extra approvals)
  • Is it a mistake in the code change? (Fix and re-plan)

Red Flag 2: Resource Replacement

  # aws_db_instance.main will be destroyed and recreated
  - will be destroyed
  + will be created

This is dangerous because it means downtime (the resource is gone during the recreation). For databases, it means data loss (usually).

Red Flag 3: Large Security Group Changes

  ~ resource "aws_security_group" "app" {
        ~ ingress {
              + cidr_blocks = ["0.0.0.0/0"]
            }
        }

Opening access to 0.0.0.0/0 (the entire internet) should be questioned. Is this intentional?

Red Flag 4: IAM Policy Changes

  ~ resource "aws_iam_role_policy" "app_role" {
        + "s3:*"
        - "s3:GetObject"
        - "s3:PutObject"
    }

Adding broad permissions (like s3:* instead of specific actions) is a security issue.

Red Flag 5: Encryption or Backup Settings Disabled

  ~ resource "aws_rds_db_instance" "main" {
        ~ storage_encrypted = true -> false
        ~ backup_retention_period = 30 -> 0
    }

Disabling encryption or backups is almost never intentional. Question this.

Green Flag: Additive Changes Only

  + resource "aws_s3_bucket" "backup" { ... }
  + resource "aws_iam_role" "service" { ... }

Creating new resources with no changes to existing ones is low risk. These plans can be approved quickly.

Blocking Dangerous Commands in CI/CD

Some commands should never run in production. Set up guards:

Block -auto-approve in Production

The -auto-approve flag skips the approval step entirely. It should only exist in dev.

In your CI/CD pipeline:

if [[ "$ENVIRONMENT" == "production" ]] && [[ "$TERRAFORM_ARGS" == *"-auto-approve"* ]]; then
  echo "❌ -auto-approve is forbidden in production"
  exit 1
fi

Block terraform destroy in Production

if [[ "$ENVIRONMENT" == "production" ]] && [[ "$COMMAND" == "destroy" ]]; then
  echo "❌ terraform destroy is forbidden in production. Use drift detection instead."
  exit 1
fi

If you need to destroy resources in production, require a separate approval process or don’t allow it through normal CI/CD.

Block -parallelism=1000 in Production

Terraform’s -parallelism flag controls how many resources change simultaneously. High parallelism can cause issues:

if [[ "$ENVIRONMENT" == "production" ]]; then
  terraform apply -parallelism=5 tfplan
else
  terraform apply -parallelism=10 tfplan
fi

Limiting parallelism means changes happen more slowly, giving you time to notice problems.

Per-Environment Policies: Auto-Approve for Dev, Manual Gate for Prod

Different environments have different risk profiles.

EnvironmentApproval RequiredAuto-Approve OKParallelismPolicy
DevNoYes10+Speed matters; we accept risk
StagingMaybeNo5Simulate production, but still safe to experiment
ProductionAlwaysNo3-5Every change is reviewed; destructive ops are blocked

Example CI/CD configuration:

# .github/workflows/terraform.yml

on: [push, pull_request]

env:
  TF_VAR_environment: ${{ github.ref == 'refs/heads/main' && 'production' || 'staging' }}

jobs:
  terraform:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - uses: hashicorp/setup-terraform@v2

      - name: Terraform Plan
        run: |
          terraform init
          terraform plan -out=tfplan

      - name: Require Approval (Production Only)
        if: env.TF_VAR_environment == 'production'
        uses: actions/github-script@v6
        with:
          script: |
            github.rest.pulls.requestReviewers({
              owner: context.repo.owner,
              repo: context.repo.repo,
              pull_number: context.issue.number,
              reviewers: ['senior-infra-engineer']
            })

      - name: Wait for Approval (Production Only)
        if: env.TF_VAR_environment == 'production'
        run: |
          # Block until PR is approved
          # (Implementation depends on your approval strategy)

      - name: Terraform Apply (Auto for Dev, Conditional for Prod)
        run: |
          if [[ "$ENVIRONMENT" == "production" ]]; then
            terraform apply tfplan  # Requires prior approval
          else
            terraform apply -auto-approve tfplan
          fi
        env:
          ENVIRONMENT: ${{ env.TF_VAR_environment }}

AWS-Specific Risks and How to Mitigate Them

Some Terraform operations are particularly risky on AWS.

Risk 1: RDS Resource Replacement

RDS instances can’t be replaced (updated in place) for certain changes:

resource "aws_db_instance" "main" {
  allocated_storage = 100  # Changed from 50
  skip_final_snapshot = false  # Safe
  apply_immediately = true  # Dangerous! Causes immediate downtime
}

If apply_immediately = true, the change happens now, not during your maintenance window. Your database is unavailable.

Mitigation: Review RDS changes extra carefully. Use apply_immediately = false in production.

Risk 2: ElastiCache Node Replacement

Changing node types in ElastiCache causes the cache to be recreated, flushing all cached data.

resource "aws_elasticache_cluster" "main" {
  node_type = "cache.t3.micro"  # Changed from cache.t3.small
}

This is a cache replacement. Plan for cache misses and increased load on your database.

Risk 3: Security Group Rule Changes During Active Traffic

Removing a security group rule during active traffic can drop connections mid-stream.

resource "aws_security_group_rule" "app_ingress" {
  type              = "ingress"
  from_port         = 443
  to_port           = 443
  protocol          = "tcp"
  cidr_blocks       = ["10.0.0.0/8"]  # Removing this rule breaks connections
}

Mitigation: Make security group changes during maintenance windows, or apply them gradually (update code, apply change, verify, then roll forward).

Rollback Options When Apply Goes Wrong

If terraform apply causes problems, you have options.

Option 1: Terraform State Rollback

If the plan that was applied was bad, you can use terraform state push to revert to the previous state:

# Save current state
terraform state pull > current-state.json

# Restore previous state (from backup)
terraform state push previous-state.json

# Re-plan (should show how to recreate the destroyed resources)
terraform plan

This is a last resort. It’s not clean. But it works when you need to undo a disaster quickly.

Option 2: Destroy and Rebuild

For some resources, it’s faster to destroy and recreate:

terraform destroy -target=aws_instance.web
terraform apply -target=aws_instance.web

This removes the corrupted resource and rebuilds it cleanly.

Option 3: Manual AWS Console Changes

If Terraform is causing problems, make changes directly in the AWS console to stabilize, then fix Terraform code and re-apply:

  1. Manually fix the problem in AWS console
  2. Update Terraform code to match
  3. Run terraform import if necessary to bring it under Terraform management
  4. Run terraform plan to verify zero changes

Tools for Safe Workflow Automation

Several tools specialize in safe Terraform workflows.

Atlantis

Atlantis is a self-hosted tool that runs terraform plan on pull requests and manages terraform apply approvals.

Workflow:

  1. Developer opens PR with infrastructure changes
  2. Atlantis runs terraform plan and posts the plan in the PR
  3. Reviewers comment atlantis apply to approve
  4. Atlantis runs terraform apply with full audit logging

Benefits:

  • Plan output is visible in the PR
  • No developer access needed to run apply
  • Full audit trail of who approved what

Spacelift

Spacelift is a SaaS platform (like Terraform Cloud) that adds approval workflows, policy enforcement, and drift detection.

Features:

  • Require approval before apply
  • Block dangerous operations (destroy, auto-approve)
  • Policy as Code (enforce naming conventions, required tags, etc.)
  • Drift detection and remediation

GitHub Actions with Required Approvals

If you’re using GitHub, you can use GitHub’s built-in approval mechanisms:

- name: Create Approval Issue
  if: github.event_name == 'pull_request'
  uses: actions/github-script@v6
  with:
    script: |
      github.rest.issues.create({
        owner: context.repo.owner,
        repo: context.repo.repo,
        title: 'Approval Required: Infrastructure Changes',
        body: 'This PR modifies production infrastructure. Requires approval from @senior-infra-engineer'
      })

Testing Your Safe Workflow

Before deploying to production, test your approval workflow in staging:

  1. Create a change in staging that would be dangerous (like increasing instance size)
  2. Verify the plan is created correctly
  3. Verify the approval requirement blocks apply
  4. Verify approval enables apply
  5. Verify the change applies correctly

If this process works in staging, you can trust it in production.

Conclusion: Safety Doesn’t Slow You Down

Teams often think safety and speed are opposites. In practice, they’re the same thing.

A team that adds 2 minutes of review time to each Terraform apply is slower per-change. But a team that loses 6 hours to a data deletion is much slower overall.

Start with the 3-gate model: plan, review, apply. Add approval requirements. Block dangerous commands. Test your rollback procedures. Measure cycle time and improve gradually.

Your goal: “We have never lost production data to a bad Terraform apply, and we never will.”

If building safe infrastructure practices feels like too much to tackle alone, FactualMinds helps teams implement governance frameworks that balance safety with speed. We’ve helped dozens of teams move from manual, error-prone infrastructure management to automated, auditable processes. Let’s talk about how to build safe Terraform workflows that your team can trust.


Ready to discuss your AWS strategy?

Our certified architects can help you implement these solutions.

Recommended Reading

Explore All Articles »