rss-bridge 2023-01-25T17:04:00+00:00

Episode 548: Alex Hidalgo on Implementing Service Level Objectives

Alex Hidalgo, principal reliability advocate at Nobl9 and author of Implementing Service Level Objectives, joins SE Radio's Robert Blumen for a discussion of service-level objectives (SLOs) and error budgets. The conversation covers the meaning...

Alex Hidalgo, principal reliability advocate at Nobl9 and author of Implementing Service Level Objectives, joins SE Radio’s Robert Blumen for a discussion of service-level objectives (SLOs) and error budgets. The conversation covers the meaning of a service level; service levels and product ownership; the pervasive nature of imperfection; and why trying to be perfect is not cost-effective. They examine service-level indicators (SLIs) and SLOs and how to define each effectively. Hidalgo clarifies differences between SLOs and service-level agreements (SLAs), as well as whether traditional metrics such as CPU and memory are good SLOs. The episode examines how to define error budgets and policies to influence engineering work, how to tell if your project is under or over budget, and how to respond to being over budget, as well as how to derive value from using up excess error budget.

Show Notes

Nobl9 – Company

Implementing SLOs from Google’s SRE Workbook Google – Site Reliability Engineering by Steven Thurgood, David Ferguson, Alex Hidalgo, and Betsy Beyer

Alex Hidalgo on Twiter: https://twitter.com/ahidalgosre

Implementing Service Level Objectives [book] by Alex Hidalgo

https://www.amazon.com/Implementing-Service-Level-Objectives-Practical/dp/1492076813/

From SE Radio

Episode 325: Tammy Butow on Chaos Engineering

Episode 453: Aaron Rinehart on Security Chaos Engineering

Episode 276: Björn Rabenstein on Site Reliability Engineering

Episode 455: Jamie Riedesel on Software Telemetry

Episode 429: Rob Skillington on High Cardinality Alerting and Monitoring

Alex in His Own Words

Intro to SLOs with Alex Hidalgo

Why Alex Believes in Breaking Things

Implementing SLOs

Transcript

Transcript brought to you by IEEE Software magazine.

This transcript was automatically generated. To suggest improvements in the text, please contact [email protected] and include the episode number and URL.

Robert Blumen 00:00:17 For Software Engineering Radio, this is Robert Blumen. Today I have with me Alex Hidalgo. Alex is a site reliability advocate at Nobl9. Prior to his current role, he was director of SRE at Nobl9 and has spent time at Squarespace and Google. Alex is the author of the book Implementing Service Level Objectives, A Practical Guide to SLIs, SLOs, and Error Budgets, published in 2020. And that will be the subject of our conversation today. Alex, welcome to Software Engineering Radio.

Alex Hidalgo 00:00:55 Thanks so much for having me. I’m excited to be here.

Robert Blumen 00:00:57 Alex, do you have anything else to say about your biography that I didn’t already cover?

Alex Hidalgo 00:01:03 One thing I do like to always talk about is the fact that I spent most of my twenties not in the technology industry. I didn’t join Google until I was 28, and I spent most of my twenties working in the service industry front of house and back of house in restaurants. So, server, line cook, bartender, I worked in warehouses, I worked at a furniture company. And the reason I like bringing that up is because, as we’ll get into, service level objectives are all about providing a certain level of service for people. And that’s exactly what you do in all those other industries. And I think that’s one of the reasons the whole approach really kind of stuck with me. And one of the reasons I got so excited about it is because it really spoke to all my experience before I moved into tech.

Robert Blumen 00:01:45 Cool. Well, we will be talking about service-level objectives. Before we dive into that, I want to frame this discussion. If an organization is thinking of adopting the approach that’s outlined in your book, so what problem are they trying to solve when they’re doing that?

Alex Hidalgo 00:02:04 So service-level objectives, at their absolute most basic, is the acceptance that failure occurs, right? You are never going to be 100% reliable, you’re never going to hit a 100% of any kind of target. Something at some point in time is going to break; something at some point in time is going to change. And service level objectives at their most basic are just saying, okay, we understand this. So instead of trying to aim for perfection, let us try to aim for the right amount, right? Pick a reasonable target. SLOs are basically a codified version of ‘don’t let great be the enemy of the good.’ Because if you are attempting to hit a 100% anything, whether or not be what I define reliability as or easier things to think about, like error rates and availability for your computer services, if you’re trying to be 100% perfect there, you’re just not going to hit it.

Alex Hidalgo 00:02:53 And if you try to, you’re going to spend way too much, both in your humans who will get burnt out as well as literally finances, right? The amount of money you have to spend to make systems redundant enough and highly available enough to even attempt to hit something like a 100%, it’s just going to cost you too much money. It’s going to cost you too much stress, you’re going to burn your employees out. So, use an SLO-based approach to help you think about what should we really be aiming for? What do our users actually need from us, and how can we keep them happy, the business happy, and our employees happy?

Robert Blumen 00:03:26 If an organization is thinking about adopting pro-outline in your book, how are they probably doing this now that maybe is not working to where they need to look at a different way of doing it?

Alex Hidalgo 00:03:38 So, very often there is a push from the top to be as good as possible, and I don’t think there’s anything wrong with potentially striving for excellence, right? SLO-based approaches are not about being lazy, they’re not about like losing sight of trying to be the best you can be, but without explicitly setting targets, without explicitly saying something like, we want to be reliable. Or let me give you like an example, right? You run a retail website of some sort, and users log in, and they add items to a shopping cart, and they are able to check out. And sometimes that’s not going to work. One of those steps is going to fail, right? Maybe user can’t log in, maybe the shopping cart microservices is flaky and they can’t get that working, right. Or sometimes just like you check out and the vendor you rely upon for your credit card processing is having a problem.

Alex Hidalgo 00:04:33 And at some point in time that’s going to fail. And that’s totally fine. Humans are actually cool with that as long as you don’t fail too often, right? So, what you can do is you can use SLOs to say something like, all right, let’s aim to have 99.9% of all of our checkouts work. So only one in a thousand users will encounter some kind of error. Especially with the understanding the user can then generally just retry and it’ll very often work the second time around. It’s about being realistic about what’s actually possible while also realizing that humans are actually okay with some amount of failure. They can absorb a certain amount of failure. And let that happen instead of spending too much time and burning your employees out by trying to be too good.

[...]

Original source