Migrating to K8s - 1

Jun 28, 2024

I joined our three-person infra team at work last year October, and our core focus has been migrating our infra from Mesos to Kubernetes with a secondary, yet, important cost-saving goal which we wanted to primarily achieve by right-sizing provisioning of resources.

cost saving

Naturally, to be safe, we tried to provision almost the same scale and capacity of resources in our K8s cluster as our Mesos cluster - roughly the equal (or more) node capacity and sizes.

This post isn’t about the migration process (incoming soon), but, three main lessons I learnt through the migration.

Documentation and Infra as a code.

When I joined in 2019, I had fairly okay AWS experience and so far, I’d mostly worked with smaller-sized startups or freelance, so I was accustomed to doing stuff directly on AWS.

So, for the first few months as a Software Engineer, having to go through Terraform to provision new infra felt like an unnecessary bottleneck.

Over time, I got accustomed to it and just accepted it as a part of my life.

It did feel like it was the right decision, and until eight months ago, I assumed I knew how crucial it was.

We also have a terrific culture of documentation at work, this, coupled with IAAC, made migrating resources less chaotic.

You didn’t need to go through the rabbit hole of AWS to figure out the security groups, network interfaces or IAM permissions that our services or dependencies needed to be functional.

They were all declared in Terraform, and it was as simple as making code changes (😅).

The core benefit of IAAC wasn’t just knowing infra dependencies, and the changes we need to make to provision new resources, but the ease of rollbacks.

No migration is smooth sailing, and so, rollbacks tend to happen more than you want. I can’t imagine how fun it would be to rollback or reprovision resources in AWS manually - I don’t wish it on my worst enemy.

No matter how little a project is, I strongly recommend investing in IAAC and in documenting as much as you can.

Its invaluable to use managed services of your cloud provider over ad-hoc-in-house solutions.

Sometimes, internally hosting dependencies of our system like your message queues, databases, and likes can save you quite a number of money. It can also provide the level of control you need - patches, upgrades, downgrades, etc., especially if you have expert hands internally.

However, people leave organizations. Experts leave, and it can lead to knowledge gaps. The people around may not know where the dead bodies are buried, and during a migration, knowledge gaps can be your worst enemy.

Migrations are risky by nature - people anxiously want to maintain service quality and reliability to customers.

In our case, because most of our critical dependencies were managed services by our cloud provider, we had much fewer things to migrate, and there was sufficient documentation and OSS community resource to figure out the changes we needed to make internally to still efficiently use these services.

In a few cases where it made sense to move some of these managed resources into K8s (I know I just listed reasons why managed services are great) because of cost benefits, and sufficiently managed helm charts, migration was hassle-free - in some cases, we had the option of asking a technical support engineer if we were blocked.

Watch your metrics.

Metrics are an important part of our lives as developers. They help us assess our service health, usage, resources, etc.

They were incredibly useful in right-sizing because we wanted to make this decision based on actual usage estimates and not “guesstimate.”

In most cases, we had about 90 days of historical data to help us get a good sense of average or maximum usage across the board.

I said in most cases because there were resources that had metrics agents set up, but, for some weird reasons stopped polling data either because the polling agents stopped working or nobody ever confirmed that the metric setup worked in the first place (much likely the former).

I don’t have a gotcha from this experience, but, short of consistently checking your metrics, I think teams should set alerts on not receiving metrics they think is important after a period of time.

Often, we only alert when something exceeds, hits, or drops below a threshold, but, when you deploy a service or resource, you set a much smaller threshold for the first few days or weeks to ensure that the service is working as expected. However, oftentimes, if the service or system isn’t essential or very stable, we hardly check our dashboards or alerts to see if we’re getting data.

After a while, you never get paged and you just assume all is well, and it could be so for the lifespan of the service, however, what if it isn’t, and you don’t just know because you weren’t polling data and it hasn’t hit critical mass yet.

So, I think it makes sense to alert on not receiving any data after a period of time.

Though one could also argue that metrics that aren’t being used should be removed, but, I think it’s better to have them and not need them than to need them and not have them.