Hi there!

Toil

Automating Terraform Increased Productivity.

In 2019, when I joined Community.com as a Software Engineer, it was my first experience using Terraform to provision infrastructure.

Prior to that time, I had some ad-hoc experience with AWS for small projects, largely through the Dashboard. However, this was my first time working with a team larger than fifteen engineers working on a microservice architecture and continually provisioning or scaling infrastructure for services.

The decision to use Terraform made sense. A declarative language to define infrastructure as code, where every engineer could literally read the state of our provisioned resources without having to navigate several dashboards and sub-dashboards.

Provisioning New Infrastructure

Adding, updating, or provisioning a resource on AWS was as simple as making changes to a codebase, opening a pull request, running terraform apply, and merging the pull request.

Yeah, it was that simple when it worked. Sadly, it did not work as often as it should have.

The thing is Terraform is stateful. It keeps track of the state of your infrastructure in a file called terraform.tfstate.

There were cases where developers would run terraform apply on their local machines, and forget to merge their PRs which would lead to inconsistencies.

To avoid the stress of solving the state deviation, most developers would either:

  1. Pull an infra team member to help them resolve the state deviation,
  2. Or just ask the infra team to run terraform apply on their behalf.
  3. Worse, they would just target the specific resources they want to apply terraform apply -target=resource_name.

Then there is a different issue that impacted developers in some countries with slow or spotty internet connections - sometimes a terraform plan could take hours.

To solve this, many developers would pull an infra team member to help them run terraform apply on their behalf which took them away from their work or in some cases, delayed core business features by a significant amount of time.

Which brings us to Toil

The term “Toil” was coined by Google SREs to describe the manual, repetitive, automatable tasks that engineers do. Toil does not necessarily mean grunt work like cleaning up monitoring dashboards or removing unused resources, or meetings, even if they are repetitive.

Toil Criteria

Google has six criteria that a task must meet to be considered toil, or at least, the more criteria a task meets, the more it qualifies as toil:

  1. Manual - requires human intervention, like running terraform apply
  2. Repetitive - needs to be done frequently, like running terraform apply for every infra change or update.
  3. Automatable - can be automated, like running terraform apply on a CI/CD pipeline or as a GitHub Action.
  4. Tactical - our example might not meet this, but the fact that it causes infra team members to be pulled away from their work to run terraform apply for developers makes it tactical.
  5. No enduring value - this is subjective because we do this because we need to “add business” value, but running terraform apply for developers does not add value to the infra team.
  6. Scale - the more developers we have, the more the chances of divergence in the different local state on their machines.

Some good examples of toil in software engineering are:

  • On-call interrupts or incident response.
  • Manual deployments around big-bang releases.
  • Investigating and acknowledging repeated false positive alerts.
  • Repeated manual works for customer support like bulk password resets or daily data imports.

Less Toil, Not No Toil

As software engineers, we deal with technical debt every day, and we understand that aiming to eliminate every form of technical debt is unrealistic. If an engineering team dedicates all their time to eliminating technical debt, they would never ship any features. Our job isn’t to build the perfect software by continuously dedicating time to fix every bug or working on every improvement, however, we are most valuable when we triage and prioritize the most important technical debt to fix while shipping features that add value to the business.

The same applies to toil. We can’t eliminate all toil, but we can aim to fix the most impactful toil that affects our productivity and the business.

Like technical debt, our aim should be to prioritize toil that can lead to more toil.

In our example, the issues do not only impact the product engineers by slowing them down and dampening their morale, it also impacts the focus and morale of the infra team. Ultimately, it impacts the ability to ship features and add value to the business.

How do we maintain this balance?

Google’s SRE team has a rule of thumb that no more than 50% of an SRE’s time should be spent on toil, and the rest on engineering work that adds value to the business.

SREs at Google are encouraged to track their toil, and work with their managers to maintain this metric over a decent period of time like a quarter or a year.

Our aim is to reduce toil as much as possible, but even addressing toil can impact morale and productivity.

Every engineer can relate to the feeling we get when we fix a bug that has been bothering us for weeks, or when we finally get to refactor a piece of code that has been slowing us down. It is a morale booster. However, when our backlog is consistently filled with fixing bugs or refactoring code without offering new features, it can become dull and demoralizing.

The same applies to toil. SRE engineers can feel a sense of accomplishment when they automate a task that has been bothering them for weeks, but if all they do is automate tasks without working on new projects or improving the system, it can become dull and demoralizing.

Therefore, it is important to maintain the balance.

Conclusion

Toil is a part of building software systems in Agile environments, and it isn’t inherently bad. However, it is important to identify and address toil that impacts productivity and morale.

In our example, we solved this problem by using a GitHub Action that applied Terraform changes and merged approved PRs. It eliminated the spotty internet issue, and the need for developers to run terraform apply on their local machines.

The ripple effect was that engineers felt more confident with making infra changes, and the infra team could focus on more important tasks like improving the CI/CD pipeline or working on the next big infra project - things that added value to the business.

As I learn more about SRE, I aim to build the skills to identify and address toil that impacts productivity and morale.