What are the common things to watch out for with Terraform?

2020-10-04|By Kamil Szczygieł|Code

Delivering high-quality Terraform code is something we are proud of doing at sysdogs on our daily basis. Through the past years, we have gained a lot of knowledge and experience doing that for a variety of customers all over the world, from a variety of industries, trying to do our best to support other teams with best-quality infrastructures. This article's intention is to be a comprehensive list of bullet points, pinning things that should be avoided. This is a completely blameless post though, try to keep it as a checklist of things you should keep in mind while writing Terraform code.


  • Use stable releases. Always. And wait for fixes, before upgrade.
  • Combine tools. Do not try to Terraform everything.
  • Do not treat examples as well-defined solutions.
  • Automate. Do not plan or apply manually.
  • Do not do things manually. Never.
  • Test additive and fresh-start change.
  • Separate modules wisely.
  • Use tools to test code statically, resources quality and security.
  • Use functions accordingly.
  • Do not overkill modules with variables.
  • Manage provider versions.

If you want to know why our engineering team looks like a bunch of magicians wearing hats while writing Terraform code, sign up for our newsletter, be an early-bird, and get notified about all the things that are gonna happen in November this year!

Use an early-bird release

Three years ago, we have been doing cloud infrastructures with Terraform 0.11. We've been waiting literally years, for Terraform 0.12 - that brought for loops

, dynamic expressions, and HCL revamp, but we did not get promised iterations on modules, which were released with Terraform 0.13. After releasing 0.13, people faced a lot of instability and crashes. Never, ever upgrade to the first minor release. Always wait for stability and initial fixes.

Not everything is Terraform

Sadly, engineers still try to Terraform things, that should not be Terraformed. Very complex virtual machines' configuration/bootstrap scripts are not what Terraform is designed for. It is important to marry other Infrastructure as Code software, like Ansible On Code Quality with Ansible, Salt or Chef that is designed for this task, with Terraform.

Example as production

Terraform documentation has pretty nice examples of resource usage in the Examples section. We have noticed many engineers treat this section as production ready, to-go code. Unfortunately, those examples are mostly insecure and present the very basic implementation which is, in most cases, not the best one, both technically and in the perspective of scale.

Example of the resource documentation

resource "aws_iam_role" "example" {
  name = "example"

  assume_role_policy = <<EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "",
      "Effect": "Allow",
      "Principal": {
        "Service": "vpc-flow-logs.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}
EOF
}

Example of Terraform resource reference from official docs

data "aws_iam_policy_document" "policy_assume_role" {
  statement {
    effect = "Allow"
    actions = [
      "sts:AssumeRole"
    ]
    principals {
      type = "Service"
      identifiers = [
        "vpc-flow-logs.amazonaws.com"
      ]
    }
  }
}

resource "aws_iam_role" "example" {
  name = "example"

  assume_role_policy = data.policy_assume_role.json
}

Employ automation

Unfortunately, we can still see operations and development teams that do not use automation at all. They run plans and apply them from their local workstation, do not commit the code to the repository, and even... keep the state file with secrets in git repositories. This in turn creates a lot of security risks and misunderstandings among team members. There is no audit log and possibility to verify the changelog. It often causes confusion among team members, when they run a plan, and see infrastructure changes that are not yet present in the common codebase.

Avoid doing things manually

Well. What else can I say? 🙂 We are fine with testing things manually and doing research and development on sandbox environments. But making a manual change in production in the twenty-first century is difficult to accept, if you'd like to keep high quality standards. Going even further, positioning yourself as "infrastructure as code adopter" while maintaining manual changes inside the infrastructure seems a little off.

Is it worth the time?

Source: https://imgs.xkcd.com/comics/is_it_worth_the_time.png

Additive and full-start

We noticed that engineers mostly stick to testing only the additive changes in Terraform. It is the easiest and the very first-approach; to deliver what is expected from us to deliver. But it is extremely important to test clean state applies too, as very often there are cyclic dependencies between states that will not allow fresh environment creation. It is crucially important to plan disaster recovery procedures and worst case scenarios.

No modules separation

One of the most common problems we've noticed while auditing our customers' code is lack of responsibility enforcement, and module separation. They duplicate resources between modules without consistent resposibility principles

. Eliminating code repetition is only one of the abundant advantages of the modular approach. The second one is an ability to enforce code standards and security compliance by, for example, forcing encryption on resources that support it.

Modules dependency hell

On the other hand, we've also noticed customers that, at some point, end up deep in the dependency hell problem. They made their modules so small that upgrading the specific module in the whole environment source code is a living nightmare, because we have to upgrade tons of dependent modules to make a small change on one resource. By the way, we solved the problem of veryfing module version across the whole Terraform state with tooling and made it open source for everyone to use: https://github.com/sysdogs/tfmodvercheck.

An example of dependency hell

Source: https://webwereld.nl/cmsdata/features/3771627/yu082e99pupei7v9_thumb800.jpg

Poor code quality

Keep it simple, stupid

and Do not repeat yourself are principles we strongly believe in every Infrastructure as Code toolset.

Make the code look good. Make it an art. Make it nice and clean. Make it consistent. Make it understandable at a first glance.

No tests

Automate all the things. Test all the things. While auditing our customers' code, we noticed the following problems: Terraform is still not treated as real code and Terraform code is not tested like real code should be. Lack of static code analysis, deep inspection analysis, unit tests and poor code quality may cause a lot of problems with infrastructure development in the future, and it is extremely important to have it in mind while writing Terraform code. With a proper continuous integration pipeline in place, you can save a lot of time by marking non-functional code, or possible security vulnerabilities, before you even apply it to the real environment.

Improper functions usage

We have noticed that engineers tend to use functions everywhere, even if they are not necessary. A good example is usage of element instead of count.index if we do not require the "wrap around" mechanism.

Variables overkill

Modules and variables are great abstraction layers, but a double-edged sword too. We've seen hundreds of modules that configure literally each resource argument with variables. In the end, you're stuck with hundreds of variables for the module that's only responsible for provisioning S3 bucket resources. Same applies to outputs. Output only things that you will leverage in a different place of the codebase.

No provider management

Most of our customers, while they start working with Terraform, do not pin provider versions. This can lead to unexpected consequences if the provider we are using will introduce breaking changes and we will miss this one, important delete, inside Terraform plan.

Unpinned provider version

provider "aws" {
  region = "eu-central-1"
}

Pinned provider version

terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 3.0"
    }
  }
}

provider "aws" {
  region = "us-east-1"
}

References

LinkedInLinkedInLinkedIn
Kamil Szczygieł photo

About the author

Kamil Szczygieł