rss-bridge 2023-06-22T21:54:00+00:00

SE Radio 569: Vladyslav Ukis on Rolling out SRE in an Enterprise

Vladyslav Ukis, author of the book Establishing SRE Foundations: A Step-by-Step Guide to Introducing Site Reliability Engineering in Software Delivery Organizations, discusses how to roll out SRE in an enterprise. SE Radio host Brijesh Ammanath speaks with Vlad about the origins of SRE and how it complements ITIL (Information Technology Infrastructure Library). They examine how firms can establish foundations for rolling out SRE, as well as how to overcome challenges they might face in adopting. Vlad also recommends steps that organizations can take to sustain and advance their SRE transformation beyond the foundations.

Vladyslav Ukis, author of the book Establishing SRE Foundations: A Step-by-Step Guide to Introducing Site Reliability Engineering in Software Delivery Organizations, discusses how to roll out SRE in an enterprise. SE Radio host Brijesh Ammanath speaks with Vlad about the origins of SRE and how it complements ITIL (Information Technology Infrastructure Library). They examine how firms can establish foundations for rolling out SRE, as well as how to overcome challenges they might face in adopting. Vlad also recommends steps that organizations can take to sustain and advance their SRE transformation beyond the foundations.

Show Notes

Related Episodes

Episode 548: Alex Hidalgo on Implementing Service-Level Objectives

Episode 544: Ganesh Datta on DevOps vs Site Reliability Engineering

Episode 455: Jamie Riedesel on Software Telemetry

Episode 276: Björn Rabenstein on Site Reliability Engineering

Other References

Establishing SRE Foundations: A Step-by-Step Guide to Introducing Site Reliability Engineering in Software Delivery Organizations (Amazon link)

Transcript

Transcript brought to you by IEEE Software magazine.

This transcript was automatically generated. To suggest improvements in the text, please contact [email protected] and include the episode number and URL.

Brijesh Ammanath 00:00:17 Welcome to Software Engineering Radio. I’m your host, Brijesh Ammanath. And today my guest is Vladyslav Ukis. Vlad is the head of R&D at Siemens Healthineers Teamplay digital health platform and reliability lead for all of Siemens Healthineers digital health products. Vlad is also the author of the book Establishing SRE Foundations, A Step-by-Step Guide to Introducing Site Reliability Engineering and Software Delivery Organizations. Vlad, welcome to Software Engineering Radio. Is there anything I missed in your bio that you would like to add?

Vladyslav Ukis 00:00:47 Thank you very much, Brijesh, for inviting me and for introducing me. I think you’ve covered everything. So looking forward to getting started with the episode.

Brijesh Ammanath 00:00:57 Great. We have covered SRE previously in SE radio in episode 548 where Alex discussed implementing service level objectives, episode 544 where Ganesh discussed the differences between DevOps and SRE, episode 455 where Jamie talked about software telemetry, and episode 276 where Bjorn talked about site reliability engineering as a subject. In this episode, we will talk about the foundations of implementing SRE within an organization and I’ll also make sure that we link back to all those previous episodes in the show notes. To start off Vlad, can you give me a brief introduction on what SRE is and how it differs from traditional ops?

Vladyslav Ukis 00:01:39 Let me start by giving you a little bit of history of SRE. SRE is a methodology that’s called site reliability engineering, and it was conceived at Google because Google had a big problem many years ago, which was Google was growing and the number of people that was required to operate Google also was growing, and the problem was that Google was growing so fast that it became impossible to hire the operations engineer in line with the growth of Google. And they were looking for solutions to that problem: How can you grow a web property in such a way that you don’t require a linear growth of operation personnel in order to run the site? And that led to the birth of SRE approaches, which they then several years later wrote up in the well-known SRE books by Google, and this is where it’s coming from. So it’s got its origins in a way of setting up operations in such a way that you can grow the site, the web property, and at the same time you don’t need to grow linearly the personnel that’s required to run it.

Vladyslav Ukis 00:03:04 So it’s got a very business-oriented approach and digging deeper, it’s got its origins in software engineering. At Google, there is a saying that SRE is what happens when you task software engineers with designing the operations function of the enterprise. And it’s true. So you, once you dig into this, you see the software engineering approach inside SRE. How it’s different from the traditional way of operating software is that it’s got a set of primitives that enable you to create good alignment of the organization on operational concerns because it gives the participants in a software delivery organization clear roles to fulfill, and using that then the alignment can be brought about if an organization is serious about implementing SRE. And once that alignment is there, then it’s possible to do the alerting of the operations engineers, not just on the traditional IT parameters — like for example, CPU is too high or the memory is too low — but you actually are able to alert on the symptoms that are really experienced by the users. So you are alerting on the higher-level stuff, so to speak, that’s really felt by the user. And once you do this, then also the alerts, they are much more meaningful to the operations engineers running the site because then there is a clear connection between the alert and the user experience, and with that the motivation to fix the problem is high. And also you don’t get as many problems, you don’t get as many alerts as you would if you just alert on the IT parameters like CPU usage is too high and things like that.

Brijesh Ammanath 00:05:01 I like the quote when you say SRE is what happens when you get software engineers to design operations and run it. And I believe that also implies that software engineers will implement the software engineer design principles, like continuous integration and engineering principles around measurability?

Vladyslav Ukis 00:05:18 Yeah, so in terms of software engineering approach in SRE, fundamentally SRE brings to the table is, imagine you’ve got a software engineering team and the software engineering team is ready to ship some digital service into production. And typically, they just do it and then they see what happens. With SRE, that’s not the approach that the team would take. With SRE, before doing the final deployment, the team will get together including the product owner and they will define the so-called service level objectives for the service, and these service level objectives, they would then quantify the reliability of the service — the reliability that they want the service to fulfill. And then once deployed to production, that reliability, which is quantified, will get monitored and then they will get alerts on whenever they don’t fulfill their liability as envisioned. So you see, it creates a very powerful feedback loop where you apply effectively the tried-and-true scientific method to software operations.

Vladyslav Ukis 00:06:32 So you, before you deploy to production, you then define the SLOs which quantify the reliability that you want your service to provide. And then, once the service is in production, then you get feedback from production that tells you whenever you don’t fulfill the reliability that you actually thought the service would provide. So, it provides that powerful additional feedback loop, which is actually pretty tight. And that means that you don’t just do continuous integration in a sense that you’ve got some stages, some stages that lead you through some testing towards production. But you also think about the operational aspects much more during the development because there is an ongoing conversation about the quantification of reliability.

[...]

Original source