Delivering high-quality Terraform code is something we are proud of doing at sysdogs on our daily basis. For the past years, we have gained a lot of knowledge and experience doing that for a variety of customers all over the world, from a variety of industries, trying to do our best to support other teams with best-quality infrastructures. This article’s intention is to be a comprehensive list of bullet points of things that should be avoided.  This is a completely blameless post. Try to keep it as a checklist of things you should not be doing while writing Terraform code.


TL;DR

  • Use stable releases. Always. And wait for fixes, before upgrade.
  • Combine tools. Do not try to Terraform everything.
  • Do not treat examples as well-defined solutions.
  • Automate. Do not plan or apply manually.
  • Do not do things manually. Never.
  • Test additive and fresh-start change.
  • Separate modules wisely.
  • Use tools to test code statically, resources quality and security.
  • Use functions accordingly.
  • Do not overkill modules with variables. 
  • Manage provider versions.

If you want to know why our engineering team looks like magicians wearing Hats while writing Terraform code, sign up for our newsletter and be an early-bird for things that are gonna happen in November this year!


Use an early-bird release

Three years ago, we have been doing cloud infrastructures with Terraform 0.11. 

We waited literally years for Terraform 0.12 that brought for loops, dynamic expressions and HCL revamp, but we did not get promised iterations on modules, which were released with Terraform 0.13. After releasing 0.13, people faced a lot of instability and crashes. Never, ever upgrade to the first minor release and always wait for stability and initial fixes.


Not everything is Terraform

Sadly, engineers still try to Terraform things that should not be Terraformed. Very complex virtual machines configuration/bootstrap scripts is not something Terraform is designed for. It is important to marry other Infrastructure as Code software, like Ansible (On code quality with Ansible), Salt or Chef that is designed for this task, with Terraform.


Example as production

Terraform documentation has pretty nice examples of resource usage in the Examples section. We have noticed engineers treat this section as production, ready to go code. Unfortunately, these examples are mostly insecure and present the very basic implementation which is, in most cases, not the best one from a technical and scale perspective.

resource "aws_iam_role" "example" {
  name = "example"

  assume_role_policy = <<EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "",
      "Effect": "Allow",
      "Principal": {
        "Service": "vpc-flow-logs.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}
EOF
}
data "aws_iam_policy_document" "policy_assume_role" {
  statement {
    effect = "Allow"
    actions = [
      "sts:AssumeRole"
    ]
    principals {
      type = "Service"
      identifiers = [
        "vpc-flow-logs.amazonaws.com"
      ]
    }
  }
}

resource "aws_iam_role" "example" {
  name = "example"

  assume_role_policy = data.policy_assume_role.json
}

No automation

Unfortunately, we still see operations and development teams that do not use automation at all. They run plans and apply the plan from their local workstation, do not commit the code to the repository and evenā€¦ keep the state file with secrets in git repositories. This causes a lot of misunderstandings among team members. There is no audit log and possibility to verify change log. It often causes confusion among team members when they run a plan and see infrastructure changes that are not yet present in the common codebase.


Doing things manually

Well. What else can I say? šŸ™‚ We are fine with testing things manually and doing research and development on sandbox environments. But making a manual change in production in the twenty-first century is difficult to accept, if we would like to keep high quality standards. Going even further, positioning yourself as ā€œinfrastructure as code adopterā€ while maintaining manual changes inside the infrastructure seems a little off.

Is It Worth the Time?
Figure. XKCD.
(Source: https://xkcd.com/1205/ )

Additive and full-start

We noticed that engineers mostly test the additive changes in Terraform. It is the easiest and the very first-approach; to deliver what is expected from us to deliver. It is extremely important to test clean state applies as very often there are cyclic dependencies between states that will not allow to create a fresh environment. It is important to plan disaster recovery procedures and worst case scenarios.


No modules separation

One of the most common problems we noticed while auditing our customers code is lack of modules responsibility and separation. They duplicate resources between modules without consistant resposibility principles. Eliminating code repetition is only one of the advantages of the modules. The second one is an ability to enforce code standards and security compliance by, for example, forcing encryption on resources that support it. 


Modules dependency hell

On the other hand, we noticed customers that meet the dependency hell problem. They made their modules so small that upgrading the specific module in the whole environment source code is a hell, because we have to upgrade tons of dependent modules to make a small change on one resource. By the way, we solved the problem of veryfing module version across the whole Terraform state with tooling and it is open sourced: https://github.com/sysdogs/tfmodvercheck.

Figure. The runtime dependency graph of Mozilla Firefox
(Source: https://edolstra.github.io/pubs/phd-thesis.pdf)

Poor code quality

Keep it simple, stupid and Do not repeat yourself are principles we strongly believe in every Infrastructure as Code toolset. 

Make the code look good. Make it an art. Make it nice and clean. Make it consistent. Make it understandable at a first glance.


No tests

Automate all the things. Test all the things. While auditing our customersā€™ code, we noticed the following problems: Terraform is still not treated as a real code and Terraform code is not tested as a real-code.

Lack of static code analysis, deep inspection analysis, unit tests and poor code quality may cause a lot of problems with infrastructure development in the future and it is extremely important to have it in mind while writing Terraform code. With a proper continuous integration pipeline in place you can save a lot of time by marking non functional code or possible security vulnerabilities before you even apply it to the real environment.


Improper functions usage

We have noticed that engineers use functions everywhere, even if they are not necessary. A good example is usage of `element` instead of `count.index` if we do not require the ā€œwrap aroundā€ mechanism.


Variables overkill

Modules and variables are great abstraction layers, but they can be a double-edged sword. We have seen hundreds of modules that configure literally each resource argument with variables. In the end we land with hundreds of variables for the module thatā€™s only responsible for provisioning S3 bucket resources. Same applies to outputs. Output only things that you will leverage in a different place of the codebase.


No provider management

Most of our customers, while they start working with Terraform, do not pin provider versions. This can lead to unexpected consequences if the provider we are using will introduce breaking changes and we will miss this one, important delete, inside Terraform plan.

provider "aws" {
  region = "eu-central-1"
}
terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 3.0"
    }
  }
}

provider "aws" {
  region = "us-east-1"
}

You have read another long write-up from sysdogs! The sweety dog is something you really deserve. šŸ„ƒ Subscribe!

Updates

  • Update 2020-10-05: Add missing references.

About the author

GDPR