[Lesson Note] Introduction to SRE
What is SRE?
SRE teams are responsible for how code is deployed, configured, and monitored, as well as the availability, latency, change management, emergency response, and capacity management of services in production - comprehensive definition.
Essentially, I see the SRE as the person implementing “puritan” DevOps principles in an organisation - a mixture of software engineering and infrastructure with an ambition to automate, and improve software delivery and reliability processes.
High level concepts
SLI vs SLO vs SLA
SLI, Service Level Indicator, is a quantitative metric that is used to measure the performance of a service. For example, successful requests per second, or average latency of a request. It could also be logs per second or specific error codes over a rolling window of time.
SLO, Service Level Objective, is a target value or range of values for a service level indicator. For example, 99.9% of requests should be successful, or 99% of requests should be served within 100ms. SLOs should be defined customer centric - with the customer’s experience in mind. For example, if a customer uses a service to find a product on an e-commerce website, it should be defined as the time it takes for the customer to see a product after a search key is entered, and not the time it takes for the search query to be processed (on the DB).
SLOs should be realistic targets to optimize for a good customer experience without sacrificing too much engineering effort that could have been spent on other features.
SLAs are the agreements between the startup and their customers. They also define the consequences of not meeting the SLOs, and how often the SLOs should be met. Generally, SLOs should be slightly higher than the SLAs to give the engineering teams sufficient buffer to respond to incidents without breaking SLAs.
These metrics should be defined by a quorum of stakeholders, including the engineering team, product team, and the business team. The engineering team should be able to provide the technical feasibility of the SLOs, and the business team should be able to provide the business impact of the SLOs. The product team should be able to provide the customer impact of the SLOs.
Error Budget
100% availability is impossible and should not be the goal. The goal should be to provide a good customer experience while still being able to ship features to customers.
Essentially, the error budget is the difference between the SLO and 100%. For example, if the SLO is 99% of files should be uploaded successfully within 30 days, the error budget is 1% of files might not be uploaded successfully within 30 days. Therefore, if 99% of files are uploaded successfully, developers can deprioritize responding to the 1% and focus on other features.
The error budget is a good way to balance between reliability and velocity. It allows the engineering team to focus on shipping features while still maintaining a good customer experience.
It is important to see the error budget as how much “unreliability” can we tolerate for velocity. The more critical the service or path, the lower the error budget should be. For example, if the service is a payment service, the error budget should be lower than a service that is used to display a list of products.
Note 1: The error budget is not a license to ignore the 1%. The 1% should be investigated and fixed, but it should not be the priority. The priority should be to ship features to customers.
Note 2: Depending on business requirements, certain stakeholders may have the right to override the error budget. We may have to ship feature delivery over reliability, however it should not be the norm. The SLOs and error budget should be reviewed periodically to ensure that it is still relevant.
Error budgets are important. They ensure services do not degrade at the expense of shipping features. They also ensure that the engineering team is not overworked by responding to every incident.
Toil & Toil Budget
Toil is repetitive manual work that can be automated. For example, the manual process of setting up a new service - GitHub repo, code boilerplate, CI setup, etc. Toil is not operational work like responding to an incident, fixing a bug or writing an incident report. It is important to think about it as work that can be (partially) automated.
Depending on the scale of the engineering team, a “Toil” could even be a script that needs to be manually run repetitively.
Not all toil should be eliminated - sunk cost fallacy. Toil should be eliminated if it is a significant amount of work, and if it is a repetitive task. For example, if a task takes 5 minutes to complete, and it is done once a month, it should not be automated. However, if a task takes 5 minutes to complete, and it is done 10 times a day, it should be automated.
The level of toil in an organization can increase to the point where the organization won’t have the capacity needed to stop it, a condition known as
Blameless Postmortems: Learning from Failure
Software systems are complex, and fail. It is crucial to learn from these failures to prevent reoccurrence. When an incident happens, teams have to come together to review the incident: what, how and why it happened. They also have to review how the incident was happened, and how it (can) was resolved. This process is called a postmortem.
A Blameless Postmortem takes the focus off finding a scapegoat, and instead focuses on learning from the incident to improve the process, and overall systems. However, it is not meant to be a “blameless” process. We investigate the root cause of the issue, and if it is a human error, we investigate why it occurred with the intention of preventing it from happening again.
It is a collaborative process that should involve all relevant stakeholders.
High level content of a postmortem:
- Summary of the incident
- Timeline of the incident
- Root cause analysis (what, why, how)
- Incident Impact
- Action Items
Why?
Removing blame from the process provides psychological safety for the team. It allows the team to focus on the incident, and not on the blame. It also allows the team to be more open and honest about the incident, and how it can be prevented in the future.
Also, it ensures we are not fixing the symptoms of the problem, but the root cause of the problem. It is the most efficient way to prevent reoccurrence of the incident.
Perfect example from Google:
Pointing fingers
"We need to rewrite the entire complicated backend system! It’s been breaking
weekly for the last three quarters and I’m sure we’re all tired of fixing things
onesy-twosy. Seriously, if I get paged one more time I’ll rewrite it myself…"
Blameless
"An action item to rewrite the entire backend system might actually prevent
these annoying pages from continuing to happen, and the maintenance manual for
this version is quite long and really difficult to be fully trained up on. I’m
sure our future on-callers will thank us!"
SRE at a Startup
A startup might not be able to hire or invest in an SRE team, however, it can still adopt SRE principles. For example:
- It can adopt the error budget, and use it to balance between reliability and velocity.
- It can also adopt the blameless postmortem process to learn from incidents.
- Startups can also invest in regular experimentations through hackathons, where engineers can work on tools to reduce toil, and improve reliability of their systems.
- Startups can build a culture of documentation, and knowledge sharing. This will go a long way in ensuring that on-call engineers have the information they need to resolve incidents, and are not “gambling” with the system.
- Startups can invest in exploratory observability tools to help them understand their systems better.
Quick notes
-
SLIs should be explicit, and detailed. For example, number of successful HTTP
request per second as an SLI is a decent metric to track, but, it could have different numbers depending on where it’s being tracked. It could be tracked at the Application Load Balancer or on the Application itself. It is important to be explicitly clear about where the SLI is being tracked.
-
SLI is often calculated as a ratio of successful requests to total requests. For example, if 100 requests are made, and 90 are successful, the SLI is 90%.
SLI = (successful_requests / total_requests) * 100
-
Every member of the team must derive SLI data from the same trustworthy source. Our monitoring and telemetry systems need to be reliable and consistent.
-
It is also important to consider the tradeoffs when we choose the source of truth for our SLI. For example, some applications might care about the packets of data that never get to the server. In this case, the source of truth should be as close to the edge as possible - with significant metadata to make sense of the data.
-
SLO recipe - what we want to track, the proportion of the thing, and a specific time period. For example, 99% of requests should be successful within 30 days.
-
SLO numbers can be derived from speaking with customers. For example, if a customer will use our product 80% of the time, we can set our SLO between 82-85%.
-
SLO numbers can, and are more often, derived from industry standards. However, industry standards should be contextualized to our unique situation - product, team capacity, resources, customers, business goals, etc. No go follow Google drag SLO, or maybe you should and be MixPanel.
-
SLO numbers can also be derived from existing metrics. However, we should consider that they might not truly represent the needs of our customers. They can either be too perfect or too poor.
-
Shift from just monitoring to observability. Observability is the ability to ask questions about the system. It is all about collecting the data points that allow us to ask questions about the health of our service, proactively. Enabling us to check that things are okay, rather than waiting for, and reacting to, an incident. More