An Introduction to Site Reliability Engineering
A lot of software engineers forget why their software is successful. (or why it's not).
The #1 reason is that your customers love your software. But making good software is hard, the stars must align. Things such as tech, product, marketing, and sales all come together.
On the tech side, it isn't all about creating new features.
It's making sure of the quality, integrity, and security of our services. We want to make our delivery process as effective as possible. And one important part of the delivery is collecting and learning from our data.
This is the role of the Site Reliability Engineer (SRE).
What is SRE?
Fundamentally, it’s what happens when you ask a software engineer to design an operations function. - Google Interview With Ben Treynor
As technology advances, many new roles get created.
One of these roles is the site reliability engineer (SRE). Benjamin Treynor originated the term in 2003. He was tasked with running services at Google in production with 7 engineers.
He had to ensure that the services would be available, reliable, and as serviceable as possible.
As he was a software engineer, he approached operational problems from a new lens. His team eventually became Google's present-day SRE team.
The reason why this whole initiative started at Google, was that there was a division between developers and operations.
They had different goals – The developer team wanted to launch new features and see how users react to them. On the other hand, the operations team wants to make sure that the services don't break.
When each team has their own goals and their own ways of doing things. Achieving business goals is hard.
SRE was Google's solution to this. They would be the bridge between developers and operations.
But how do they do that?
They use software tools to automate IT infrastructure tasks such as system management and application monitoring. SREs ensure their software application remain reliable despite frequent updates from development teams. All of this is done in a scalable way using software tools.
But now you might be having a question that almost everyone asks...
What's the difference between SRE and DevOps?
SRE and DevOps are both approaches that aim to improve the collaboration between developers and operations, but they have different focuses and perspectives.
Here are the key differences between SRE and DevOps:
- Focus – SRE focus on ensuring the reliability, availability, and performance of a software system. On the other hand, DevOps is more of streamlining the collaboration between development and operation teams.
- Roles and Responsibilities – SRE is typically a specialized role within an organization (mostly large). DevOps, on the other hand, is more of a philosophy that encourages collaboration and shared responsibility across development, operations, and other relevant teams.
- Metrics and Objectives – SRE has a strong focus on service-level objectives (SLOs) and error budgets. DevOps, on the other hand, emphasizes metrics like deployment frequency, lead time, and mean time to recovery (MTTR). The goal is to achieve shorter development cycles, faster deployment, and quicker recovery from failures.
- Automation and Tooling – SREs heavily rely on automation to manage and operate systems at scale, leveraging software engineering practices to build and maintain reliable infrastructure. DevOps also emphasizes automation but focuses more on the automation of the software delivery pipeline, including continuous integration, testing, deployment, and monitoring.
Core Responsibilities of an SRE
Most DevOps and IT professionals focus solely on improving developer processes, forgetting about their systems in production.
But the vast majority of application and infrastructure costs happen after deployment. This is one of the main reasons why we should dedicate more time to our services in production.
In order to do that without disrupting our velocity (time to deliver features), SRE teams are formed which are dedicated to the continuous improvement of production services.
SREs may differ from company to company but they usually have the same core areas of responsibilities:
Automation
One of the core responsibilities of an SRE is automating toil.
Toil is anything that can be automated and is not being automated. Things such as deployment, monitoring, logging, incident management, etc...
Remember, SRE fixes operational issues with the mindset of a software engineer. So that things can be reproducible across the whole organization.
Monitoring
The first job of an SRE engineer is to have some situational awareness. They would probably ask questions such as:
- How are our services performing?
- What's our average latency?
- Are nodes failing or are they all up?
The solution to this is to implement a monitoring solution. But it's important to know that we shouldn't just monitor everything. It's up to the SRE team to decide which metrics are important for the business.
The other important factor is that our monitoring solutions should provide a holistic view of a system's health. Anyone from the organization should be able to look at the single source of truth to determine the overall performance and availability of the services they support.
But don't worry, it's not like every company's metrics are unique. Google has studied and come up with four key metrics that every company should track.
They are called the Four Golden Signals. – (We will talk about them in depth later on in this article.)
Availability
Availability is defined as the amount of time a service, device, database, or other IT infrastructure is usable.
Being on doesn't always mean being usable. For example, an HTTP service that returns 500 consistently is not available. To combat that companies set service-level objects, agreements, and indicators (SLO, SLA, SLIs) for the specific service.
These terms may seem a bit vague to you, so let's define them:
- Service Level Indicators (SLI) – The metric used to define the performance or quality of a service. For example, an SLI for a web service might be the average response time in milliseconds.
- Service Level Objectives (SLO) – It is a target or goal defined for one or more SLIs. SLOs define the desired level of performance, reliability, or quality that a service provider aims to achieve. They specify the acceptable values or ranges for SLIs. SLOs are typically defined based on the requirements and expectations of customers or users. For example, an SLO might state that the average response time of a web service should be less than 200 milliseconds for 99% of the requests over a one-week period.
- Service Level Agreements – It is a formal contract or agreement between a service provider and its customers or users. It outlines the agreed-upon levels of service, performance, and support that the provider will deliver. SLAs often include specific SLOs and define the consequences or remedies if the agreed-upon service levels are not met.
SRE teams work to figure out the SLI, SLO, and SLA for the organization. – After it has defined those values, it will work on improving them over time. Hence adding greater business value.
As the saying goes,
You can't improve something, if you don't measure it.
Performance
As your team gets clarity over its availability. It can now focus on improving service performance metrics like latency, page load speeds, etc...
So you could ask questions such as:
- Which services are frequently failing?
- Are customers experiencing page load or latency lag?
SRE teams not only help developers to fix bugs, but they should help identify any performance bottlenecks.
Preparation
In my opinion, SRE teams bring the organization more clarity.
Clarity of their services, their performance, availability, incidents, and so on. This makes the organization more prepared. Hence making development teams able to deploy new features quickly and respond to incidents faster.
Prepared organizations know the status of their systems and how to respond when there are issues. SRE teams aren't there to fix the issues themselves but they provide the tools and the bridge between developers and operations to work effectively together to tackle issues.
Four Golden Signals
The golden signals help us define what it means for a system to be "healthy". They are meant to be simple and a starting block for any monitoring strategy.
Let's take a look at the four signals.
Latency
Latency is the time it takes to serve a request.
In the beginning, you can define a benchmark for how long a request should be. For example, anywhere between 50 to 100ms is good.
You can then monitor the latency of successful requests against failed requests to keep track of the health of the service. Tracking the services across your organization can help identify services that are not performing well.
Traffic
Traffic also called "throughput" refers to the load stress of the system. The more traffic, the more the load on the system.
Traffic is important because it helps you know how much load can your service handle. Your service might be fine with 1k requests but what about 100k? – Traffic can mean different things depending on the type of service. An API would count the number of requests while for a site it can be how many users are on the site.
Errors
Error rate defines the rate of requests that are failing. SRE teams are tasked with monitoring the error rates of the entire system and the individual services.
They also need to categorize the errors, whether they are from invalid client requests or actual bugs in the code. Critical errors need to be identified as quickly as possible so that they can be fixed.
Saturation
Saturation is defined as the overall capacity of the service.
It's a high-level overview of the resource utilization of the system. SRE teams would ask questions such as these to get the saturation of a specific service:
- What's the current utilization of the service?
- How much more capacity can it have?
- How much more can the service take before it starts to degrade?
- Does the current level of saturation satisfy the customer/business requirements?
Taking Action
The four signals are meant to be a starting point for your organization's monitoring journey.
Tracking the latency, traffic, errors, and saturation will enable all teams to identify issues faster. It also allows teams to see the overall health of the system, not just the individual services. You can also have different views for different roles such as marketing, sales, business, etc...
Effective monitoring will give you the ability to see the health of your organization and see where you can work next. So getting this right is a crucial step.
Conclusion
In summary:
- SRE is the bridge between developers and operations.
- The SRE's four golden signals are latency, traffic, errors, and saturation.
- Effective monitoring will give you the ability to see the health of your organization and see where you can work next.
- The four golden signals help us define what it means for a system to be "healthy".
Thanks for reading, below I've compiled a list of resources that are useful if you want to learn more about Site Reliability Engineering.