AI & assistant-friendly summary

This section provides structured content for AI assistants and search engines. You can cite or summarize it when referencing this page.

Summary

Infrastructure drift—when your actual AWS resources differ from what your IaC declares—causes silent failures and makes disaster recovery impossible. Learn how to detect drift systematically and fix it before it breaks production.

Key Facts

  • Infrastructure drift—when your actual AWS resources differ from what your IaC declares—causes silent failures and makes disaster recovery impossible
  • Infrastructure drift—when your actual AWS resources differ from what your IaC declares—causes silent failures and makes disaster recovery impossible

Entity Definitions

IaC
IaC is a cloud computing concept discussed in this article.

AWS Infrastructure Drift Detection: How to Find and Fix Config Drift Before It Breaks Production

Infrastructure 9 min read

Quick summary: Infrastructure drift—when your actual AWS resources differ from what your IaC declares—causes silent failures and makes disaster recovery impossible. Learn how to detect drift systematically and fix it before it breaks production.

Key Takeaways

  • Infrastructure drift—when your actual AWS resources differ from what your IaC declares—causes silent failures and makes disaster recovery impossible
  • Infrastructure drift—when your actual AWS resources differ from what your IaC declares—causes silent failures and makes disaster recovery impossible
Table of Contents

Your infrastructure code declares that your database should have automated backups. But last week, a database engineer disabled backups to speed up a test, and nobody re-enabled them. The code says one thing. AWS does another. This is infrastructure drift.

Drift is silent. It doesn’t trigger alerts. It doesn’t break deployments. It just sits there until something goes wrong—a data loss, a security incident, a failed failover—and you discover the problem wasn’t in your code, it was in the gap between your code and reality.

This guide covers how to detect drift systematically, what to do when you find it, and how to prevent it from happening again.

What Is Infrastructure Drift and Why It’s Dangerous

Infrastructure drift occurs when the actual state of your AWS resources diverges from what your Infrastructure as Code (IaC) declares. This can happen in multiple ways:

Drift types:

  1. Configuration drift — Someone changed a setting in the AWS console (security group rule, RDS backup window, S3 bucket encryption)
  2. Structural drift — A resource was created or deleted outside of IaC (manual resource provisioning)
  3. Compliance drift — Resources no longer meet your security or compliance policies (encryption disabled, public access enabled, outdated OS patch)
  4. Tag drift — Resources are missing tags required for cost allocation or compliance
  5. Version drift — Infrastructure was created with an older version of a tool and never updated

Why drift is dangerous:

  • Disaster recovery breaks — If you lose a resource and rebuild from IaC code, you won’t get the manually-configured properties
  • Compliance violations — You think you have encryption enabled because your code declares it, but the console shows it’s disabled
  • Debugging nightmares — Engineers spend hours investigating why production behaves differently than the staging environment, not realizing staging is drifted
  • Cost surprises — Someone increased instance sizes or storage manually, and nobody notices until the bill arrives
  • Security gaps — A security group rule was added manually to “temporarily” allow traffic, and it’s still there a year later

How Drift Happens

Understanding drift causes helps you prevent them.

Manual Console Changes

The most common cause: someone needs to fix something urgently, logs into the AWS console, makes the change, and “will update the code later.” They don’t.

Example: A database is slow, so someone increases the instance size from db.t3.medium to db.t3.large. A week later, someone checks the code and expects the instance to be medium. It’s not.

Emergency Patches

A security vulnerability is discovered in your RDS database. You apply the patch immediately via the console, with plans to update your IaC tomorrow. Tomorrow becomes next week, and the code still declares the old version.

Tool-Generated Resources

You use AWS SAM, CloudFormation, or a managed service that creates resources automatically. Your Terraform code doesn’t know about these resources, or it’s out of sync with what the tool actually created.

Permissions and Assumptions

You assume only IaC creates resources, so you don’t check. But a contractor spun up an EC2 instance for testing. A different team created an S3 bucket for backup storage. They’re not declared in IaC.

Time and Turnover

Your infrastructure code was written 6 months ago. Since then, AWS released new features. Your code still declares the old way of doing things. AWS now offers better defaults, and you’re missing them.

Tools for Drift Detection: The Terraform Approach

There are several ways to detect drift on AWS. The best approach depends on your infrastructure setup.

1. Terraform Plan as a Drift Detector

The simplest tool is the one you already have: terraform plan.

terraform plan -refresh=true

The -refresh=true flag tells Terraform to:

  1. Query AWS for the current state of each resource
  2. Compare actual state to your Terraform state file
  3. Compare your Terraform code to the state file
  4. Show what would change if you apply

If your Terraform state matches your code, but terraform plan shows changes, you have drift.

Example output:

Resource actions are as follows:

  ~ aws_security_group.api will be updated in-place

      ~ ingress {
            cidr_blocks = [
              + "192.168.1.0/24",
            ]
            from_port  = 443
            to_port    = 443
          }

This tells you someone added a security group rule that isn’t in your code.

2. AWS Config for Continuous Drift Monitoring

AWS Config evaluates your resources against rules continuously. It detects configuration drift at the AWS layer.

Example rule: Check that all EC2 instances have the required tags.

resource "aws_config_config_rule" "ec2_required_tags" {
  name = "ec2-required-tags"

  source {
    owner             = "AWS"
    source_identifier = "REQUIRED_TAGS"
  }

  input_parameters = jsonencode({
    tag1Key = "Environment"
    tag2Key = "Owner"
  })
}

Config evaluates this rule hourly and reports which resources are non-compliant.

Advantages:

  • Continuous monitoring (not just on-demand)
  • Works with any resource type (Terraform or not)
  • Integrates with CloudWatch for alerting

Disadvantages:

  • Limited to predefined rule types (though you can write custom rules)
  • Doesn’t know about IaC intent (it can’t tell you “your code says t3.medium but you’re running t3.large”)
  • Adds cost to your AWS bill

3. CloudFormation Drift Detection

If you use CloudFormation (or AWS SAM, which generates CloudFormation), CloudFormation has built-in drift detection:

aws cloudformation detect-stack-drift --stack-name my-stack

CloudFormation compares each resource in the stack to the resource definition in the template. It reports which resources have drifted.

Works well if: You use CloudFormation exclusively.

Doesn’t work if: You mix CloudFormation and Terraform, or if some resources are created manually.

4. Automated Drift Detection in CI/CD

Many teams run terraform plan on a schedule (daily or hourly) to detect drift continuously.

Example GitHub Actions workflow:

name: Detect Infrastructure Drift

on:
  schedule:
    - cron: "0 2 * * *"  # 2 AM UTC daily

jobs:
  drift_detection:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - uses: hashicorp/setup-terraform@v2

      - name: Initialize Terraform
        run: terraform init

      - name: Detect Drift
        run: terraform plan -refresh=true

      - name: Report Drift
        if: failure()
        run: |
          echo "Infrastructure drift detected!"
          # Send to Slack, PagerDuty, or email

If terraform plan shows changes, the action fails and alerts your team.

Triage: Which Drift Needs Immediate Fix?

Not all drift is equally urgent. Some drift is intentional. Some is benign. Some is a critical security issue.

Severity levels:

LevelExampleAction
CriticalSecurity group rule opened to 0.0.0.0/0, encryption disabled, public access enabledFix immediately
HighInstance type upgraded (cost impact), backup retention reducedFix within 1 week
MediumTags missing, non-critical settings changedFix within 1 sprint
LowNon-impactful properties diverged (descriptions, comments)Document and deprioritize

How to triage:

  1. Review the terraform plan output carefully
  2. Check AWS console to understand what changed and why
  3. Ask: “If this resource is destroyed and recreated from code, would it break anything?”
  4. Ask: “Does this violate security or compliance requirements?”
  5. Decide: fix the resource or update the code

Remediation Strategies: Import, Revert, or Update Code

When you find drift, you have three options.

Option 1: Update Your Code to Match Reality

If the drifted state is better than what your code declares, update the code.

Scenario: Someone increased the RDS instance from db.t3.medium to db.t3.large because it needed more capacity. The drift detection found it. The t3.large is correct and should stay.

Fix:

resource "aws_db_instance" "main" {
  instance_class = "db.t3.large"  # Was db.t3.medium
  # ... rest of config
}

Run terraform plan and verify zero changes. Commit and deploy.

Option 2: Revert the Resource to Match Code

If someone made a change that shouldn’t have been made, revert it.

Scenario: A database engineer disabled automated backups to speed up a test. The IaC declares backups as enabled. The backups should stay enabled.

Fix:

Option A: Revert manually in the AWS console (if it’s safe).

Option B: Use Terraform to revert:

terraform apply  # This will re-enable backups

This works if the change is safe and non-destructive. If the change is destructive (like deleting data), you need to handle it carefully.

Option 3: Import the Drifted State

If you want Terraform to manage something that was created manually, import it.

See our detailed guide on Terraform state management for the import workflow.

Preventing Drift: Immutable Infrastructure Patterns

The best drift is drift you never create.

Pattern 1: Destroy and Rebuild Instead of Modifying

Instead of modifying resources in place, destroy the old one and create a new one. This forces the change through code.

Example: Database version upgrade.

Instead of:

Click RDS console → Select database → Modify version → Apply

Do this:

Update code → terraform apply → destroys old instance, creates new one → restore from snapshot

This ensures the new version is declared in code.

Pattern 2: Require Code Review Before Resource Changes

Establish a policy: no changes to resources without updating code first. Make this a code review requirement:

  1. Engineer identifies needed change
  2. Engineer updates IaC code
  3. Code is reviewed and merged
  4. Change is applied via CI/CD pipeline

This prevents the “I’ll update the code later” problem.

Pattern 3: Break-Glass Procedures for Emergencies

Sometimes you need to bypass this process. Define a break-glass procedure:

  1. Emergency change is made directly in console
  2. An incident ticket is created
  3. Someone with authority approves the change
  4. Code is updated within 24 hours (SLA)
  5. Change is reapplied via IaC

This allows for true emergencies while maintaining accountability.

Pattern 4: Immutable Infrastructure (Immutable Instances)

For EC2 instances, don’t modify them. Recreate them:

  • Instance needs a security patch? → Terminate and recreate from updated AMI
  • Instance needs a config change? → Update the config in your deployment tool (Ansible, Chef), rebuild AMI, terminate and recreate

This prevents configuration drift entirely because instances are never modified—they’re replaced.

Drift Detection as Part of Your Disaster Recovery Process

Drift detection isn’t just a nice-to-have. It’s essential for disaster recovery.

When you rebuild infrastructure from code (because a region failed, or you’re switching clouds), you’re trusting that your code is complete and current. If your code is drifted, your rebuild will be incomplete.

Test your recovery process:

  1. Run terraform plan to detect current drift
  2. Fix all drift
  3. (In a non-production environment) destroy all resources
  4. Run terraform apply to rebuild
  5. Verify the rebuilt infrastructure is identical to production

If this process fails, you’ve found a gap in your IaC. Fix it before a real disaster.

Conclusion: Drift Is a Visibility Problem

Infrastructure drift isn’t a technology problem—it’s a visibility problem. Teams that don’t detect drift lose the ability to trust their IaC. Teams that detect drift continuously stay in control.

Start with terraform plan on a schedule. Add AWS Config for compliance monitoring. Establish policies around emergency changes. Test your disaster recovery process regularly.

If you’re managing complex AWS infrastructure and struggling with drift—or if you’re concerned your current disaster recovery wouldn’t actually work—we can help. At FactualMinds, we help teams establish infrastructure governance practices that catch drift early and prevent silent failures. Whether you’re building IaC from scratch or auditing an existing setup, we ensure your infrastructure is what it claims to be.


Ready to discuss your AWS strategy?

Our certified architects can help you implement these solutions.

Recommended Reading

Explore All Articles »