Handling AzAPI Resource Tainting due to 504 Gateway Timeouts in Terraform

When working with Azure infrastructure through Terraform, it’s common to encounter edge cases where the state of a resource becomes unclear. This is especially true when using the azapi provider to manage resources not yet supported by the official azurerm provider. In this article, I’ll walk through a recent issue I encountered involving a tainted resource, the root cause behind it, and how to work around such situations effectively.

Background: When AzureRM Isn’t Enough

In my scenario, I needed to provision a resource under the Microsoft.CloudTest namespace, which isn’t currently supported by the AzureRM provider. The workaround was to use the more flexible azapi provider, which allows direct interaction with Azure REST APIs.

Here’s the resource definition I used:

resource "azapi_resource" "linux_build_agent2" {
  name      = "it-foo-linux-build-agent"
  type      = "Microsoft.CloudTest/images@2020-05-07"
  parent_id = azurerm_resource_group.example.id
  location  = "westus2"
  schema_validation_enabled = false

  # Additional configuration here...
}

I have to take advantage of the AzAPI feature of disabling schema validation — since even ARM isn’t sure what the resource schema should be. Provisioning this resource took a considerable amount of time. Eventually, Terraform threw an error:

Error: Failed to create/update resource
...
RESPONSE 504: 504 Gateway Timeout
ERROR CODE: GatewayTimeout
{
  "error": {
    "code": "GatewayTimeout",
    "message": "The gateway did not receive a response from 'Microsoft.CloudTest' within the specified time period."
  }
}

Understanding the 504 Gateway Timeout

This error indicates that the Azure Management Gateway did not receive a timely response from the backend service — in this case, Microsoft.CloudTest. Importantly, this doesn’t mean the operation failed. It simply means Terraform didn’t get confirmation of success or failure, which puts the resource in an ambiguous state. Terraform’s response to this is to taint the resource. Tainting marks the resource for destruction and recreation on the next terraform apply, because Terraform cannot safely assume its current state aligns with reality.

Dealing with the Tainted Resource

In my case, even though Azure did successfully provision the resource (as I later confirmed through the portal and azCLI), Terraform remained unaware due to the timeout. As a result, it continued to treat the resource as tainted:

# azapi_resource.linux_build_agent2 is tainted, so must be replaced
-/+ resource "azapi_resource" "linux_build_agent2" {
    ~ id     = ".../Microsoft.CloudTest/images/it-foo-linux-build-agent" -> (known after apply)
    ~ output = {} -> (known after apply)
    # other attributes unchanged
}

Attempting to import the existing resource didn’t work either because the resource was still marked as tainted.

Resolution: Manually Untainting the Resource

To fix this, I used the following command:

terraform untaint azapi_resource.linux_build_agent2

This explicitly told Terraform to stop treating the resource as needing replacement, allowing it to continue managing the resource as-is.

Mitigating Future Timeouts

To reduce the chances of similar issues in the future, it’s worth configuring appropriate timeouts for long-running operations. Terraform supports setting lifecycle timeouts like so:

timeouts {
  create = "240m"
  read   = "15m"
}

However, in my case, even setting a generous 4-hour creation timeout didn’t help. The issue stemmed from what appeared to be a transient problem within Azure’s backend or its internal handling of the Microsoft.CloudTest API. Since Terraform does not automatically retry in the event of a 504 timeout, this becomes a potential point of fragility for long-running or less mature Azure services.

Conclusion

When using the AzAPI provider to manage less common or experimental Azure services, you’re likely to encounter challenges not present in more mature provider resources. Timeout errors and tainted resources are a direct result of Terraform’s cautious approach to infrastructure state. While this behavior is rooted in safety, it can create confusion when the infrastructure has, in fact, been successfully deployed.

Understanding the implications of tainted resources, knowing when to manually untaint, and applying generous timeout values can help maintain workflow continuity. Still, it’s important to recognize that some errors — like transient 504s — may simply require manual intervention and a bit of patience.

Have you faced similar timeout issues with AzAPI or other Terraform providers?