Detecting and Remediating State Drift in Terraform

When you deploy your infrastructure with Terraform, the state is recorded to map the resource instances to the configuration. When the configuration is updated, Terraform looks at the current state and then creates a plan to update or delete the resource in response to the configuration change.

Infrastructure is constantly changing and sometimes manual changes are required in the case of quick fixes or tests. However, such changes can build up quickly and before you know it the resources have diverged significantly from the original definition.

In this blog post we'll explore different examples of drift, how to detect it and some approaches to reconciliation.

Detecting drift

The terraform plan command reads the current settings of the managed remote object and updates the state. This is then compared with the configuration and a set of actions to take is proposed. This command wont commit any changes but you can use this to determine if there are any untoward actions planned.

For instance if we have a load balancer deployed with Terraform and at some time Terraform plans to replace some properties despite any change in the definition, we can infer from this some drift has been introduced.

First example

Consider an Azure network security group and network security rule, lets start out with its initial definition and deploy it. Here we have a resource group to deploy our resources in, an NSG and a security rule.

resource "azurerm_resource_group" "example" {
  name     = "example-rg"
  location = "UK South"
}

resource "azurerm_network_security_group" "example" {
  name                = "example-nsg"
  location            = azurerm_resource_group.example.location
  resource_group_name = azurerm_resource_group.example.name
}

resource "azurerm_network_security_rule" "example" {
  name                        = "ips-allowed"
  priority                    = 100
  direction                   = "Inbound"
  access                      = "Allow"
  protocol                    = "Tcp"
  source_port_range           = "*"
  destination_port_range      = "*"
  source_address_prefixes     = ["10.0.0.5", "10.0.0.6"]
  destination_address_prefix  = "*"
  resource_group_name = azurerm_resource_group.example.name
  network_security_group_name = azurerm_network_security_group.example.name
}

Introduce drift

We'd like to test inbound connectivity from the IP 10.0.0.7 and so we decide to manually add that into the security rule on the Azure portal. When we next run terraform plan it will show us what it intends to do

Note: Objects have changed outside of Terraform

Terraform detected the following changes made outside of Terraform since the last "terraform apply":

  # azurerm_network_security_group.example has been changed

// some code omitted 

Terraform will perform the following actions:

  # azurerm_network_security_rule.example will be updated in-place
  ~ resource "azurerm_network_security_rule" "example" {
        id                                         = "/subscriptions/6aff0bdf-7c12-4850-8f7e-7dd4633b4110/resourceGroups/ar-test/providers/Microsoft.Network/networkSecurityGroups/example-nsg/securityRules/ips-allowed"
        name                                       = "ips-allowed"
      ~ protocol                                   = "TCP" -> "Tcp"
      ~ source_address_prefixes                    = [
          - "10.0.0.7",
            # (2 unchanged elements hidden)
        ]
        # (13 unchanged attributes hidden)
    }

Plan: 0 to add, 1 to change, 0 to destroy.

It plans to remove the newly added IP address as it wasn't defined in the configuration, so Terraform attempts to reconcile this drift by removing the IP address.

Resolve drift

In this case the resolution is simply updating the security rule resource definition to match that of the remote infrastructure. Let's add the 10.0.0.7 IP address into the source_address_prefixes argument and see what Terraform now plans

resource "azurerm_network_security_rule" "example" {
//
  source_address_prefixes     = ["10.0.0.5", "10.0.0.6", "10.0.0.7"]
//
}

> terraform plan

No changes. Your infrastructure matches the configuration.

Brilliant, we've aligned the Terraform with the infrastructure fairly smoothly.

Not all drift can be fixed by updating the definition however, sometimes resources need to be recreated to reconcile drift. Let's take a look at an example that explores this.

Second example

Consider an Azure virtual network, again lets define and deploy it. We'll deploy it in the same example resource group.

resource "azurerm_virtual_network" "example" {
  name                = "example-vnet"
  address_space       = ["10.0.0.0/16"]
  location            = azurerm_resource_group.example.location
  resource_group_name = azurerm_resource_group.example.name
}

Introduce drift

Let's say some new resource deployment standards have been introduced and one of the standards required engineers to add the environment to the suffix of their resource names.

In a hurry to ensure compliance, our engineer manually recreates the VNet with the required naming standard then later updates the terraform definition to match and assumes all is well.

Our VNet resource now looks like this, where the name argument has been updated

resource "azurerm_virtual_network" "example" {
  name                = "example-vnet-prod"
  address_space       = ["10.0.0.0/16"]
  location            = azurerm_resource_group.example.location
  resource_group_name = azurerm_resource_group.example.name
}

When we next run a terraform plan it will show us what it intends to do

An execution plan has been generated and is shown below.
Resource actions are indicated with the following symbols:
  + create

Terraform will perform the following actions:

  # azurerm_virtual_network.example will be created
  + resource "azurerm_virtual_network" "example" {
      + address_space       = [
          + "10.0.0.0/16",
        ]
      + dns_servers         = (known after apply)
      + guid                = (known after apply)
      + id                  = (known after apply)
      + location            = "uksouth"
      + name                = "example-vnet-prod"
      + resource_group_name = "example-rg"
      + subnet              = (known after apply)
    }

Plan: 1 to add, 0 to change, 0 to destroy.

That doesn't look right - that resource already exists with the same properties so it shouldn't need to create it again.

Typically this implies the resource is not managed by Terraform and therefore not mapped in the state file. Another engineer sees this and decides to run terraform import. This command imports existing infrastructure into the state and brings it under Terraform management. Once import has been run, we should see Terraform have no changes to plan, however in this case the below output is given

> terraform import 'azurerm_virtual_network.example' /subscriptions/0abc1def-2g34-5678-9h0i-1jk2345l6789/resourceGroups/example-rg/providers/Microsoft.Network/virtualNetworks/example-vnet-prod

Error: Resource already managed by Terraform

Terraform is already managing a remote object for azurerm_virtual_network.example.
To import to this address you must first remove the existing object from the
state.

This error is due to the fact that the original resource, deployed with the same resource address, has been manually deleted from Azure but not from the Terraform state.

Resolve drift

Let's now have a look at what's recorded in the state with state list

> terraform state list

azurerm_resource_group.example
azurerm_virtual_network.example

We can describe the resource to view its properties in detail with state show

> terraform state show 'azurerm_virtual_network.example'

# azurerm_virtual_network.example:
resource "azurerm_virtual_network" "example" {
    address_space           = [
        "10.0.0.0/16",
    ]
    dns_servers             = []
    flow_timeout_in_minutes = 0
    guid                    = "ab12c345-6789-0123-4d5e-6f78ghi90123"
    id                      = "/subscriptions/0abc1def-2g34-5678-9h0i-1jk2345l6789/resourceGroups/example-rg/providers/Microsoft.Network/virtualNetworks/example-vnet"
    location                = "uksouth"
    name                    = "example-vnet"
    resource_group_name     = "example-rg"
    subnet                  = []
}

Upon inspection we can see the old resource deployed with Terraform still in the state. The name attribute with the old value has helped us to deduce this.

To resolve our configuration drift issue, we remove this old resource from the state file with state rm and then re-import the resource into the state

> terraform state rm 'azurerm_virtual_network.example'

Removed azurerm_virtual_network.example
Successfully removed 1 resource instance(s).

> terraform import 'azurerm_virtual_network.example' /subscriptions/0abc1def-2g34-5678-9h0i-1jk2345l6789/resourceGroups/example-rg/providers/Microsoft.Network/virtualNetworks/example-vnet-prod

azurerm_virtual_network.example: Importing from ID "/subscriptions/0abc1def-2g34-5678-9h0i-1jk2345l6789/resourceGroups/example-rg/providers/Microsoft.Network/virtualNetworks/example-vnet-prod"...
azurerm_virtual_network.example: Import prepared!
  Prepared azurerm_virtual_network for import
azurerm_virtual_network.example: Refreshing state... [/subscriptions/0abc1def-2g34-5678-9h0i-1jk2345l6789/resourceGroups/example-rg/providers/Microsoft.Network/virtualNetworks/example-vnet-prod]

Import successful!

The resources that were imported are shown above. These resources are now in
your Terraform state and will henceforth be managed by Terraform.

And so now the next time our engineer runs plan they see

> terraform plan

No changes. Your infrastructure matches the configuration.

Great! It took a few more steps and some scrutiny this time but the drift is now resolved and the Terraform has been aligned with the remote infrastructure.

Bonus tip

In the virtual network example, the resource block declares the resource type azurerm_virtual_network and the given local name example. If we were to update the local name to something more meaningful such as vnet-prod, Terraform will propose the following changes

> terraform plan

  # azurerm_virtual_network.example will be destroyed
  - resource "azurerm_virtual_network" "example" {
      - address_space           = [
          - "10.0.0.0/16",
        ] -> null
      - dns_servers             = [] -> null
      - flow_timeout_in_minutes = 0 -> null
      - guid                    = "ab12c345-6789-0123-4d5e-6f78ghi90123" -> null
      - id                      = "/subscriptions/0abc1def-2g34-5678-9h0i-1jk2345l6789/resourceGroups/example-rg/providers/Microsoft.Network/virtualNetworks/example-vnet-prod" -> null
      - location                = "uksouth" -> null
      - name                    = "example-vnet-prod" -> null
      - resource_group_name     = "example" -> null
      - subnet                  = [] -> null
      - tags                    = {} -> null

      - timeouts {}
    }

  # azurerm_virtual_network.vnet-prod will be created
  + resource "azurerm_virtual_network" "vnet-prod" {
      + address_space       = [
          + "10.0.0.0/16",
        ]
      + dns_servers         = (known after apply)
      + guid                = (known after apply)
      + id                  = (known after apply)
      + location            = "uksouth"
      + name                = "example-vnet-prod"
      + resource_group_name = "example"
      + subnet              = (known after apply)
    }

Plan: 1 to add, 0 to change, 1 to destroy.

The resource plans to be re-created not because of changes in certain arguments - they are identical - but due to the change in the resource address.

To resolve this case, we can utilise the state mv command, which will make Terraform track the resource under our updated resource instance address.

> terraform state mv 'azurerm_virtual_network.example' 'azurerm_virtual_network.vnet-prod'

Move "azurerm_virtual_network.example" to "azurerm_virtual_network.vnet-prod"
Successfully moved 1 object(s).

> terraform plan

No changes. Your infrastructure matches the configuration.

Terraform has compared your real infrastructure against your configuration and found no differences, so no changes are needed.

Conclusion

We've looked at different examples of drift, how to detect it using terraform plan and how to remediate by updating configuration and manipulating the state file.

Some useful commands we used were import, state list, state show, state rm and state mv. Ensure you take a backup of the state file prior to working on it should you need to revert.

Sometimes for testing purposes it's ok to make manual changes, such as adding IP addresses to a security rule. However, if you intend to keep those changes make sure to commit it in your IaC. For immutable changes on resources such as the name, you will have to recreate them and when you do ensure you stick to the Terraform workflow and avoid manual changes.